I currently have a dataframe with sales data, named "visitresult_and_outcome".
I have a column named "DATEONLY" that holds the sale date (format yyyy-mm-dd) in string format.
I now want to make 2 new dataframes: 1 for the sales made in the weekend, 1 for the sales made on weekdays. How can i do this in an efficient way?
df['dayofweek'] = df['DATEONLY'].dt.dayofweek
This will pull the day of the week out of your date attributes. Creating your other dataframes will just be a matter of slicing.
I have a dataset ranging from 2009 to 2019. The Dates include Years, months and days. I have two columns: one with dates and the other with values. I need to group my Dataframe monthly summing up the Values in the other column. At the moment what I am doing is setting the date column as index and using "df.resample('M').sum()".
The problem is that this is grouping my Dataframe monthly but for each different year (so I have 128 values in the "date" column). How can I group my data only for the 12 months without taking into consideration years?
Thank you very much in advance
I attached two images as example of the Dataset I have and the one I want to obtain.
Dataframe I have
Dataframe I want to obtain
use dt.month on your date column.
Example is
df.groupby(df['date'].dt.month).agg({'value':'sum'})
This question already has answers here:
Extracting just Month and Year separately from Pandas Datetime column
(13 answers)
Closed 3 months ago.
I have a dataframe with a date column (type datetime). I can easily extract the year or the month to perform groupings, but I can't find a way to extract both year and month at the same time from a date. I need to analyze performance of a product over a 1 year period and make a graph with how it performed each month. Naturally I can't just group by month because it will add the same months for 2 different years, and grouping by year doesn't produce my desired results because I need to look at performance monthly.
I've been looking at several solutions, but none of them have worked so far.
So basically, my current dates look like this
2018-07-20
2018-08-20
2018-08-21
2018-10-11
2019-07-20
2019-08-21
And I'd just like to have 2018-07, 2018-08, 2018-10, and so on.
You can use to_period
df['month_year'] = df['date'].dt.to_period('M')
If they are stored as datetime you should be able to create a string with just the year and month to group by using datetime.strftime (https://strftime.org/).
It would look something like:
df['ym-date'] = df['date'].dt.strftime('%Y-%m')
If you have some data that uses datetime values, like this:
sale_date = [
pd.date_range('2017', freq='W', periods=121).to_series().reset_index(drop=True).rename('Sale Date'),
pd.Series(np.random.normal(1000, 100, 121)).rename('Quantity')
]
sales = pd.concat(data, axis='columns')
You can group by year and date simultaneously like this:
d = sales['Sale Date']
sales.groupby([d.dt.year.rename('Year'), d.dt.month.rename('Month')]).sum()
You can also create a string that represents the combination of month and year and group by that:
ym_id = d.apply("{:%Y-%m}".format).rename('Sale Month')
sales.groupby(ym_id).sum()
A couple of options, one is to map to the first of each month:
Assuming your dates are in a column called 'Date', something like:
df['Date_no_day'] = df['Date'].apply(lambda x: x.replace(day=1))
If you are really keen on storing the year and month only, you could map to a (year, month) tuple, eg:
df['Date_no_day'] = df['Date'].apply(lambda x: (x.year, x.month))
From here, you can groupby/aggregate by this new column and perform your analysis
One way could be to transform the column to get the first of month for all of these dates and then create your analsis on month to month:
date_col = pd.to_datetime(['2011-09-30', '2012-02-28'])
new_col = date_col + pd.offsets.MonthBegin(1)
Here your analysis remains intact as monthly
I have a pandas dataframe with datetime index (30 min frequency). And I want do remove "n" last days from it. My dataframe do not include weekends, so if the last day of it is Monday, I want to remove Monday, Friday and Thursday (from the end). So, I mean observed days, not calendar. What is the most pythonic way to do it?
Thanks.
Pandas knows about Monday to Friday as business days.
So if you want to remove the last n business days from your dataframe, you can just do:
df.drop(df[df.index >= df.index.max().date()-pd.offsets.BDay(n-1)].index, inplace=True)
If you really need to remove observed days in the dataframe, if will be slightly more complex because you will have to count the days. Code could be (using a companion dataframe called df_days):
# create a dataframe with same index and only one row per day:
df_days = pd.DataFrame(index=df.index).assign(day=df.index.date).drop_duplicates('day')
# now count the observed day in the companion dataframe
df_days['new_day'] = 1
df_days['days'] = df_days['new_day'].cumsum()
# compute first index to remove to remove last observed n days
ix = df_days.loc[df_days['days'] == df_days['days'].max() + 1 - n].index[0]
# ok drop the last observed n days from the initial dataframe and delete the companion one
df.drop(df.loc[df.index > ix].index)
del df_days
My DataFrame has the following format:
I resampled the values based on a monthly basis, but the problem is that even the datatime index start from 2017-07-08, the Date Column after grouping by month and finding the mean, start from 2017-01-31. (There are not data at all in my DataFrame from January 2017 to August 2017). The data recording has started from August 2017.
Could you please give me some insights to understand what is happening?