This is, I think, a rather simple question which I have not been able to find a proper answer.
I have a pandas dataframe with the following characteristics
shape(frame)
Out[117]: (3652, 2)
Here 3652 refers to days within a decade (3652 since we have 2 leap years)
I would like to add a third column that shows date range between 2035-01-01 and 2044-12-31
Many thanks
Related
example data
I'm given a set of data transaction gathered throughout 3 years. I am required to count the number of transactions that occur each month and identify which month and year has more than 300 transactions.
I tried using this but idk how else I can do it.
Can you help me please?
The image attached has an example of the data I'm want to process
df[df[('Transaction_date')].value_counts()
You need to further preprocessing your data so you can groupby month and year but you need to provide more information in question so my answer be specific for your question my answer is general so far
df['year'] = df['Transaction_date'].dt.year
df['month'] = df['Transaction_date'].dt.month
df.groupby(['year','month']).size()
I have a pandas dataframe and the index column is time with hourly precision. I want to create a new column that compares the value of the column "Sales number" at each hour with the same exact time one week ago.
I know that it can be written in using shift function:
df['compare'] = df['Sales'] - df['Sales'].shift(7*24)
But I wonder how can I take advantage of the date_time format of the index. I mean, is there any alternatives to using shift(7*24) when the index is in date_time format?
Try something with
df['Sales'].shift(7,freq='D')
I have a dataframe that looks like this
I need to adjust the time_in_weeks column for the 34 number entry. When there is a duplicate uniqueid with a different rma_created_date that means there was some failure that occurred. The 34 needs to be changed to calculate the number of weeks between the new most recent rma_created_date (2020-10-15 in this case) and subtract the rma_processed_date of the above row 2020-06-28.
I hope that makes sense in terms of what I am trying to do.
So far I did this
def clean_df(df):
'''
This function will fix the time_in_weeks column to calculate the correct number of weeks
when there is multiple failured for an item.
'''
# Sort by rma_created_date
df = df.sort_values(by=['rma_created_date'])
Now I need to perform what I described above but I am a little confused on how to do this. Especially considering we could have multiple failures and not just 2.
I should get something like this returned as output
As you can see what happened to the 34 was it got changed to take the number of weeks between 2020-10-15 and 2020-06-26
Here is another example with more rows
Using the expression suggested
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_processed_date.dt.isocalendar().week.sub(df.rma_processed_date.dt.isocalendar().week.shift(1)),df.time_in_weeks)
I get this
Final note: if there is a date of 1/1/1900 then don't perform any calculation.
Question not very clear. Happy to correct if I interpreted it wrongly.
Try use np.where(condition, choiceif condition, choice ifnotcondition)
#Coerce dates into datetime
df['rma_processed_date']=pd.to_datetime(df['rma_processed_date'])
df['rma_created_date']=pd.to_datetime(df['rma_created_date'])
#Solution
df['time_in_weeks']=np.where(df.uniqueid.duplicated(keep='first'),df.rma_created_date.sub(df.rma_processed_date),df.time_in_weeks)
I'm trying to convert the column Year (type: int64) into a date type so that I can use the Groupby function to group by decade.
I'm using the following code to convert the datatype:
import datetime as dt
crime["Date"]=pd.TimedeltaIndex(crime["Year"], unit='d')+dt.datetime(1960,1,1)
crime[["Year","Date"]].head(10)
Screenshot of output
The date it is returning to me is not correct - it isn't starting at the correct year and the day is increasing by the rows.
I want the year to start at 1960, and for each row the year to increase by 1.
I tried substituting unit='d' in the code above with unit='y' and I get the following result:
Value Error: Units 'M' and 'Y' are no longer supported, as they do not represent unambiguous timedelta value durations.
I think #kate's answer is what you want. I wrote my answer before that one came along. I thought my answer might still be worth something to explain why unit='y' isn't supported, and why unit='d' isn't working for you either...
I wouldn't think this would be right:
TimedeltaIndex(crime["Year"], unit='d')
as I expect this to be interpreting your year count as a count of days. If you can't use unit='y', then maybe there's a good reason for that. Maybe that is because years don't always have the same number of days in them, and so specifying a number of years is ambiguous in terms of the number of days that equates to. You have to add any count of years to an actual year for it to make exact sense.
The same holds true, even moreso, for months, since months have a variety of day counts, so you can have no idea what a timedelta in months really means.
I would add the column in the following way:
crime['Date'] = crime['Year'].map(lambda x: dt.datetime(1960 + x,1,1))
I have a dataframe that has Date as its index. The dataframe has stock market related data so the dates are not continuous. If I want to move lets say 120 rows up in the dataframe, how do I do that. For example:
If I want to get the data starting from 120 trading days before the start of yr 2018, how do I do that below:
df['2018-01-01':'2019-12-31']
Thanks
Try this:
df[df.columns[df.columns.get_loc('2018-01-01'):df.columns.get_loc('2019-12-31')]]
Get location of both Columns in the column array and index them to get the desired.
UPDATE :
Based on your requirement, make some little modifications of above.
Yearly Indexing
>>> df[df.columns[(df.columns.get_loc('2018')).start:(df.columns.get_loc('2019')).stop]]
Above df.columns.get_loc('2018') will give out numpy slice object and will give us index of first element of 2018 (which we index using .start attribute of slice) and similarly we do for index of last element of 2019.
Monthly Indexing
Now consider you want data for First 6 months of 2018 (without knowing what is the first day), the same can be done using:
>>> df[df.columns[(df.columns.get_loc('2018-01')).start:(df.columns.get_loc('2018-06')).stop]]
As you can see above we have indexed the first 6 months of 2018 using the same logic.
Assuming you are using pandas and the dataframe is sorted by dates, a very simple way would be:
initial_date = '2018-01-01'
initial_date_index = df.loc[df['dates']==initial_date].index[0]
offset=120
start_index = initial_date_index-offset
new_df = df.loc[start_index:]