Can you extract both year AND month from date in Pandas [duplicate] - python

This question already has answers here:
Extracting just Month and Year separately from Pandas Datetime column
(13 answers)
Closed 3 months ago.
I have a dataframe with a date column (type datetime). I can easily extract the year or the month to perform groupings, but I can't find a way to extract both year and month at the same time from a date. I need to analyze performance of a product over a 1 year period and make a graph with how it performed each month. Naturally I can't just group by month because it will add the same months for 2 different years, and grouping by year doesn't produce my desired results because I need to look at performance monthly.
I've been looking at several solutions, but none of them have worked so far.
So basically, my current dates look like this
2018-07-20
2018-08-20
2018-08-21
2018-10-11
2019-07-20
2019-08-21
And I'd just like to have 2018-07, 2018-08, 2018-10, and so on.

You can use to_period
df['month_year'] = df['date'].dt.to_period('M')

If they are stored as datetime you should be able to create a string with just the year and month to group by using datetime.strftime (https://strftime.org/).
It would look something like:
df['ym-date'] = df['date'].dt.strftime('%Y-%m')

If you have some data that uses datetime values, like this:
sale_date = [
pd.date_range('2017', freq='W', periods=121).to_series().reset_index(drop=True).rename('Sale Date'),
pd.Series(np.random.normal(1000, 100, 121)).rename('Quantity')
]
sales = pd.concat(data, axis='columns')
You can group by year and date simultaneously like this:
d = sales['Sale Date']
sales.groupby([d.dt.year.rename('Year'), d.dt.month.rename('Month')]).sum()
You can also create a string that represents the combination of month and year and group by that:
ym_id = d.apply("{:%Y-%m}".format).rename('Sale Month')
sales.groupby(ym_id).sum()

A couple of options, one is to map to the first of each month:
Assuming your dates are in a column called 'Date', something like:
df['Date_no_day'] = df['Date'].apply(lambda x: x.replace(day=1))
If you are really keen on storing the year and month only, you could map to a (year, month) tuple, eg:
df['Date_no_day'] = df['Date'].apply(lambda x: (x.year, x.month))
From here, you can groupby/aggregate by this new column and perform your analysis

One way could be to transform the column to get the first of month for all of these dates and then create your analsis on month to month:
date_col = pd.to_datetime(['2011-09-30', '2012-02-28'])
new_col = date_col + pd.offsets.MonthBegin(1)
Here your analysis remains intact as monthly

Related

Remove Date Grouping from Data

Looking to clean multiple data sets in a more automated way. The current format is year as column, month as row, the number values.
Below is an example of the current format, the original data has multiple years/months.
Current Format:
Year
Jan
Feb
2022
300
200
Below is an example of how I would like the new format to look like. It combines month and year into one column and transposes the number into another column.
How would I go about doing this in excel or python? Have files with many years and multiple months.
New Format:
Date
Number
2022-01
300
2022-02
200
Check below solution. You need to extend month_df for the months, current just cater to the example.
import pandas as pd
df = pd.DataFrame({'Year':[2022],'Jan':[300],'Feb':[200]})
month_df = pd.DataFrame({'Char_Month':['Jan','Feb'], 'Int_Month':['01','02']})
melted_df = pd.melt(df, id_vars=['Year'], value_vars=['Jan', 'Feb'], var_name='Char_Month',value_name='Number')
pd.merge(melted_df, month_df,left_on='Char_Month', right_on='Char_Month').\
assign(Year=melted_df['Year'].astype('str')+'-'+month_df['Int_Month'])\
[['Year','Number']]
Output:

Anyway to group or sort dates by month and/or year using python?

Looking at a fruit and veg dataset with prices and dates. However when I try to plot anything with the date there are way too many instances as the date feature does it by each week. Is there anyway to either group the dates by month or something? The date format is like 2022-02-11.
A simple way is to add a month column and group by it. We use pandas.DataFrame.groupby and pandas.DatetimeIndex to do this.
df['Month'] = pd.DatetimeIndex(df.date).month
df.groupby(['Month']).sum().plot()

Python3: How can i select only weekdays from a pandas dataframe?

I currently have a dataframe with sales data, named "visitresult_and_outcome".
I have a column named "DATEONLY" that holds the sale date (format yyyy-mm-dd) in string format.
I now want to make 2 new dataframes: 1 for the sales made in the weekend, 1 for the sales made on weekdays. How can i do this in an efficient way?
df['dayofweek'] = df['DATEONLY'].dt.dayofweek
This will pull the day of the week out of your date attributes. Creating your other dataframes will just be a matter of slicing.

Changing frequency of pandas Period and PeriodIndex

I am importing some stock data that has annual report information into a pandas DataFrame. But the date for the annual report end date is an odd month (end of january) rather than end of year.
years = ['2017-01-31', '2016-01-31', '2015-01-31']
df = pd.DataFrame(data = years, columns = ['years'])
df
Out[357]:
years
0 2017-01-31
1 2016-01-31
2 2015-01-31
When I try to add in a PeriodIndex which shows the period of time the report data is valid for, it defaults to ending in December rather than inferring it from the date string
df.index = pd.PeriodIndex(df['years'], freq ='A')
df.index
Out[367]: PeriodIndex(['2017', '2016', '2015'], dtype='period[A-DEC]',
name='years', freq='A-DEC')
Note that the frequency should be 'A-JAN'.
I assume this means that the end date can't be inferred from PeriodIndex and the end date string I gave it.
I can change it using the asfreq method and anchored offsets anchored offsets using "A-JAN" as the frequency string. But, this changes all of the individual periods in the PeriodIndex rather than individually as years can have different reporting end dates for their annual report (in the case of a company that changed their reporting period).
Is there a way to interpret each date string and correctly set each period for each row in my pandas frame?
My end goal is to set a period column or index that has a frequency of 'annual' but with the period end date set to the date from the corresponding row of the years column.
** Expanding this question a bit further. Consider that I have many stocks with 3-4 years of annual financial data for each and all with varying start and end dates for their annual reporting frequencies (or quarterly for that matter).
Out[14]:
years tickers
0 2017-01-31 PG
1 2016-01-31 PG
2 2015-01-31 PG
3 2017-05-31 T
4 2016-05-31 T
5 2015-05-31 T
What I'm trying to get to is a column with proper Period objects that are configured with proper end dates (from the year column) and all with annual frequencies. I've thought about trying to iterate through the years and use apply.map or lambda function and the pd.Period function. It may be that a PeriodIndex can't exist with varying Period Objects in it that have varying end dates. something like
for row in df.years:
s.append(pd.Period(row, freq='A")
df['period']= s
#KRkirov got me thinking. It appears the Period constructor is not smart enough to set the end date of the frequency by reading the date string. I was able to get the frequency end date right by building up an anchor string from the end date of the reporting period as follows:
# return a month in 3 letter abbreviation format (eg. "JAN")
df['offset'] = df['years'].dt.strftime('%b').str.upper()
# now build up an anchor offset string (eg. "A-JAN" )
# for quarterly report (eg. "Q-JAN") for q report ending January for year
df['offset_strings'] = "A" + '-' + df.offset
Anchor strings are documented in the pandas docs here.
And then iterate through the rows of the DataFrame to construct each Period and put it in a list, then add the list of Period objects (which is coerced to a PeriodIndex) to a column.
ps = []
for i, r in df.iterrows():
p = pd.Period(r['years'], freq = r['offset_strings']))
ps.append(p)
df['period'] = ps
This returns a proper PeriodIndex with the Period Objects set correctly:
df['period']
Out[40]:
0 2017
1 2016
2 2015
Name: period, dtype: object
df['period'][0]
Out[41]: Period('2017', 'A-JAN')
df.index = df.period
df.index
Out[43]: PeriodIndex(['2017', '2016', '2015'], dtype='period[A-JAN]',
name='period', freq='A-JAN')
Not pretty, but I could not find another way.

Python: Date conversion to year-weeknumber, issue at switch of year

I am trying to convert a dataframe column with a date and timestamp to a year-weeknumber format, i.e., 01-05-2017 03:44 = 2017-1. This is pretty easy, however, I am stuck at dates that are in a new year, yet their weeknumber is still the last week of the previous year. The same thing that happens here.
I did the following:
df['WEEK_NUMBER'] = df.date.dt.year.astype(str).str.cat(df.date.dt.week.astype(str), sep='-')
Where df['date'] is a very large column with date and times, ranging over multiple years.
A date which gives a problem is for example:
Timestamp('2017-01-01 02:11:27')
The output for my code will be 2017-52, while it should be 2016-52. Since the data covers multiple years, and weeknumbers and their corresponding dates change every year, I cannot simply subtract a few days.
Does anybody have an idea of how to fix this? Thanks!
Replace df.date.dt.year by this:
(df.date.dt.year- ((df.date.dt.week>50) & (df.date.dt.month==1)))
Basically, it means that you will substract 1 to the year value if the week number is greater than 50 and the month is January.

Categories

Resources