I have a dataset which contains URLs with publish date (YYYY-MM-DD), visits. I want to calculate benchmark (average) of visits for a complete year. Pages were published on different dates.....e. g. Weightage/contribution of 1st page published in Aug (with 10,000 visits) will be more as compare to 2nd page published in March (11,000).
Here is my dataset:
First step:
So first of all I want to add a column (i.e. time frame) in my data set which can calculate the time frame from the Publish date. For example: if the page was published on 2019-12-10, it can give the time frame/duration from my today's date, expected o/p: (Dec 2019, 9 Months). i.e. (Month Year on which the page was published, Total months from today)
Second step:
I want to normalize/rescale my data (visits) on the basis of calculated time frame column in step 1.
How can I calculate average/benchmark.
for the first step you can use following code:
read dataframe
import pandas as pd
df = pd.read_csv("your_df.csv")
My example dataframe as below:
Pub.Dates Type Visits
0 2019-12-10 00:00:00 A 1000
1 2019-12-15 00:00:00 A 5000
2 2018-06-10 00:00:00 B 6000
3 2018-03-04 00:00:00 B 12000
4 2019-02-10 00:00:00 A 3000
for normalizing the date:
at first define a method to normalize just a date:
from datetime import datetime
def normalize_date(date): # input: '2019-12-10 00:00:00'
date_obj = datetime.strptime(date,"%Y-%m-%d %H:%M:%S") # get datetime object
date_to_str = date_obj.strftime("%B %Y") # 'December 2019'
diff_date = datetime.now() - date_obj # find diff from today
diff_month = int(diff_date.days / 30) # convert days to month
normalized_value = date_to_str + ", " + str(diff_month) + " months"
return normalized_value # 'December 2019, 9 months'
now apply the above method to all values of the date column:
df['Pub.Dates'] =list(map(lambda x: normalize_date(x), df["Pub.Dates"].values))
The normalized dataframe will be:
Pub.Dates Type Visits
0 December 2019, 9 months A 1000
1 December 2019, 9 months A 5000
2 June 2018, 27 months B 6000
3 March 2018, 31 months B 12000
4 February 2019, 19 months A 3000
5 July 2020, 2 months C 9000
but for the second step if there are multiple records per month you can do the following steps, groupby date and other columns you need then get mean of them:
average_in_visits = df.groupby(("Pub.Dates", "Type")).mean()
the result will be:
Visits
Pub.Dates Type
December 2019, 9 months A 3000
February 2019, 19 months A 3000
July 2020, 2 months C 9000
June 2018, 27 months B 6000
March 2018, 31 months B 12000
Related
I have dataframe which contains one column of month and year as string :
>>>time index value
January 2021 y 5
January 2021 v 8
May 2020 y 25
June 2020 Y 13
June 2020 x 11
June 2020 v 10
...
I would like to change the column "time" into datetime format so I can sort the table by chronological order.
Is thery any way to do it when the time is string with month name and number?
#edit:
when I do :
result_Table['time']=pd.to_datetime(result_Table['time'],format='%Y-%m-%d')
I recieve error:
ValueError: time data January 2021 doesn't match format specified
Sample dataframe:
df=pd.DataFrame({'time':['January 2021','May 2020','June 2020']})
If you want to specify the format parameter then that should be '%B %Y' instead of '%Y-%m-%d':
df['time']=pd.to_datetime(df['time'],format='%B %Y')
#OR
#you can also simply use:
#df['time']=pd.to_datetime(df['time'])
output of df:
time
0 2021-01-01
1 2020-05-01
2 2020-06-01
For more info regarding format codes visit here
I would like to group the data based on the month January and February. Here is a sample of the data set that I have.
Date Count
01.01.2019 1
01.02.2019 7
02.01.2019 4
03.01.2019 4
04.01.2019 1
04.02.2019 5
I want to group the data as follows, where total count is summed up of count based on month 1(Jan) and 2(Feb):
Month Total_Count
Jan 10
Feb 12
Cast to datetime, group by the dt.month_name and sum:
(df.groupby(pd.to_datetime(df['Date'], format='%d.%m.%Y')
.dt.month_name()
.str[:3])['Count']
.sum()
.rename_axis('Month')
.reset_index(name='Total_Count'))
Month Total_Count
0 Feb 12
1 Jan 10
To sort the index by month, we could instead do:
s = df.groupby(pd.to_datetime(df['Date-'], format='%d.%m.%Y-').dt.month)['Count'].sum()
s.index = pd.to_datetime(s.index, format='%m').month_name().str[:3]
s.rename_axis('Month').reset_index(name='Total_Count')
Month Total_Count
0 Jan 10
1 Feb 12
df
Year Month Name Avg
2015 Jan 12
2015 Feb 13.4
2015 Mar 10
...................
2019 Jan 11
2019 Feb 11
Code
df['Month Name-Year']= pd.to_datetime(df['Month Name'].astype(str)+df['Year'].astype(str),format='%b%Y')
In the dataframe, df, the groupby output avg is on keys month name and year. So month name and year are actually multilevel indices. I want to create a third column Month Name Year so that I can do some operation (create plots etc) using the data.
The output I am getting using the code is as below:
Year Month Name Avg Month Name-Year
2015 Jan 12 2015-01-01
2015 Feb 13.4 2015-02-01
2015 Mar 10 2015-03-01
...................
2019 Nov 11 2019-11-01
2019 Dec 11 2019-12-01
and so on.
The output I want is 2015-Jan, 2015-Feb etc in Month Name-Year column...or I want 2015-01, 2015-02...2019-11, 2019-12 etc (only year and month, no days).
Please help
One type of solution is converting to datetimes and then change format by Series.dt.to_period or Series.dt.strftime:
df['Month Name-Year']=pd.to_datetime(df['Month Name']+df['Year'].astype(str),format='%b%Y')
#for months periods
df['Month Name-Year1'] = df['Month Name-Year'].dt.to_period('m')
#for 2010-02 format
df['Month Name-Year2'] = df['Month Name-Year'].dt.strftime('%Y-%m')
Simpliest is solution without convert to datetimes only join with - and convert years to strings:
#format 2010-Feb
df['Month Name-Year3'] = df['Year'].astype(str) + '-' + df['Month Name']
...what is same like converting to datetimes and then converting to custom strings:
#format 2010-Feb
df['Month Name-Year31'] = df['Month Name-Year'].dt.strftime('%Y-%b')
I used to use year and week number to convert to date for some propose.
it works well before 2019, but when i tried to import 2019 wk1 data, it wired.
2019 wk1 becomes between 2019-01-07 ~ 2019-01-03
But on the contrary, if i use date to convert to year and wk, it's correct.
May I know what's wrong with my code? thanks
d = {'year': [2018, 2018, 2018, 2019, 2019, 2019], 'week': [0, 1, 52, 0, 1, 2]}
df = pd.DataFrame(data=d)
df['date'] = pd.to_datetime(df['year'].map(str) + df['week'].map(str) + '-4', format='%Y%W-%w')
df['yearwk'] = pd.to_datetime(df['date'], format='%Y-%M-%D').dt.strftime('%YW%V')
print(df)
year week date yearwk
0 2018 0 2018-01-04 2018W01
1 2018 1 2018-01-04 2018W01
2 2018 52 2018-12-27 2018W52
3 2019 0 2019-01-03 2019W01
4 2019 1 2019-01-10 2019W02
5 2019 2 2019-01-17 2019W03
I use given year and weeknum to convert to date. ideally, 2019WK1 should be between 2018-12-31 to 2019-01-05, but it became 2019-01-06 to 2019-01-13.
then I use that date to convert to year and wk, the result is what's I expected.
According to strftime() and strptime() Behavior,
%W: Week number of the year (Monday as the first day of the week) as a decimal number. All days in a new year preceding the first Monday are considered to be in week 0.
%w: Weekday as a decimal number, where 0 is Sunday and 6 is Saturday.
You put 4 as %w, indicating you want date of Thursday of the week number you provide.
First week (%W=1) of 2018 that starts with Monday is: Jan 1 - Jan 7 or Jan 4 (Thursday)
First week (%W=1) of 2019 that starts with Monday is: Jan 7 - Jan 13 or Jan 10 (Thursday)
For 2018, there's no Thursday the week before, so %W=0 will output the same as %W=1. However, for 2019, there is a Thursday the week before, so %W=0 will output Jan 3.
I have the following pandas dataframe:
Cost
Year Month ID
2016 1 10 40
2 11 50
2017 4 1 60
The columns Year, Month and ID make up the index. I want to set the values within Month to be the name equivalent (e.g. 1 = Jan, 2 = Feb). I've come up with the following code:
df.rename(index={i: calendar.month_abbr[i] for i in range(1, 13)}, inplace=True)
However, this changes the values within every column in the index:
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Apr Jan 60 # Jan here is incorrect
I obviously only want to change the values in the Month column. How can I fix this?
use set_levels
m = {1: 'Jan', 2: 'Feb', 4: 'Mar'}
df.index.set_levels(
df.index.levels[1].to_series().map(m).values,
1, inplace=True)
print(df)
Cost
Year Month ID
2016 Jan 10 40
Feb 11 50
2017 Mar 1 60