example data
I'm given a set of data transaction gathered throughout 3 years. I am required to count the number of transactions that occur each month and identify which month and year has more than 300 transactions.
I tried using this but idk how else I can do it.
Can you help me please?
The image attached has an example of the data I'm want to process
df[df[('Transaction_date')].value_counts()
You need to further preprocessing your data so you can groupby month and year but you need to provide more information in question so my answer be specific for your question my answer is general so far
df['year'] = df['Transaction_date'].dt.year
df['month'] = df['Transaction_date'].dt.month
df.groupby(['year','month']).size()
Related
I have an issue with Python not filling the months of latest observed year. Does anyone know how to expand the line of code to get the values for all the months of the whole year of 2021?
I understand that our df is not any longer than 2021 and ffill does exactly what I have asked it to do. But is there any way to lengthen it by adding something that says so?
#to monthly scores
esg_data = esg_data.set_index("datadate")
esg_data = esg_data.groupby("conm").resample('M').ffill()
esg_data = esg_data.reset_index(level=0, drop=True).reset_index()
The output of the code can be found from here:
Screenshot of the output
Thanks in advance
Angela
This is, I think, a rather simple question which I have not been able to find a proper answer.
I have a pandas dataframe with the following characteristics
shape(frame)
Out[117]: (3652, 2)
Here 3652 refers to days within a decade (3652 since we have 2 leap years)
I would like to add a third column that shows date range between 2035-01-01 and 2044-12-31
Many thanks
I'm trying to find the maximum rainfall value for each season (DJF, MAM, JJA, SON) over a 10 year period. I am using netcdf data and xarray to try and do this. The data consists of rainfall (recorded every 3 hours), lat, and lon data. Right now I have the following code:
ds.groupby('time.season).max('time')
However, when I do it this way the output has a shape of (4,145,192) indicating that it's taking the maximum value for each season over the entire period. I would like the maximum for each individual season every year. In other words, output should have something with a shape like (40,145,192) (4 values for each year x 10 years)
I've looked into trying to do this with DataSet.resample as well using time=3M as the frequency, but then it doesn't split the months up correctly. If I have to I can alter the dataset, so it starts in the correct place, but I was hoping there would be an easier way considering there's already a function to group it correctly.
Thanks and let me know if you need anymore details!
Resample is going to be the easiest tool for this job. You are close with the time frequency but you probably want to use the quarterly frequency with an offset:
ds.resample(time='QS-Mar').max('time')
These offsets can be further configured as described in the Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
I have a long time series, eg.
import pandas as pd
index=pd.date_range(start='2012-11-05', end='2012-11-10', freq='1S').tz_localize('Europe/Berlin')
df=pd.DataFrame(range(len(index)), index=index, columns=['Number'])
Now I want to extract all sub-DataFrames for each day, to get the following output:
df_2012-11-05: data frame with all data referring to day 2012-11-05
df_2012-11-06: etc.
df_2012-11-07
df_2012-11-08
df_2012-11-09
df_2012-11-10
What is the most effective way to do this avoiding to check if the index.date==give_date which is very slow. Also, the user does not know a priory the range of days in the frame.
Any hint do do this with an iterator?
My current solution is this, but it is not so elegant and has two issues defined below:
time_zone='Europe/Berlin'
# find all days
a=np.unique(df.index.date) # this can take a lot of time
a.sort()
results=[]
for i in range(len(a)-1):
day_now=pd.Timestamp(a[i]).tz_localize(time_zone)
day_next=pd.Timestamp(a[i+1]).tz_localize(time_zone)
results.append(df[day_now:day_next]) # how to select if I do not want day_next included?
# last day
results.append(df[day_next:])
This approach has the following problems:
a=np.unique(df.index.date) can take a lot of time
df[day_now:day_next] includes the day_next, but I need to exclude it in the range
If you want to group by date (AKA: year+month+day), then use df.index.date:
result = [group[1] for group in df.groupby(df.index.date)]
As df.index.day will use the day of the month (i.e.: from 1 to 31) for grouping, which could result in undesirable behavior if the input dataframe dates extend to multiple months.
Perhaps groupby?
DFList = []
for group in df.groupby(df.index.day):
DFList.append(group[1])
Should give you a list of data frames where each data frame is one day of data.
Or in one line:
DFList = [group[1] for group in df.groupby(df.index.day)]
Gotta love python!
I am trying to convert a dataframe column with a date and timestamp to a year-weeknumber format, i.e., 01-05-2017 03:44 = 2017-1. This is pretty easy, however, I am stuck at dates that are in a new year, yet their weeknumber is still the last week of the previous year. The same thing that happens here.
I did the following:
df['WEEK_NUMBER'] = df.date.dt.year.astype(str).str.cat(df.date.dt.week.astype(str), sep='-')
Where df['date'] is a very large column with date and times, ranging over multiple years.
A date which gives a problem is for example:
Timestamp('2017-01-01 02:11:27')
The output for my code will be 2017-52, while it should be 2016-52. Since the data covers multiple years, and weeknumbers and their corresponding dates change every year, I cannot simply subtract a few days.
Does anybody have an idea of how to fix this? Thanks!
Replace df.date.dt.year by this:
(df.date.dt.year- ((df.date.dt.week>50) & (df.date.dt.month==1)))
Basically, it means that you will substract 1 to the year value if the week number is greater than 50 and the month is January.