pandas groupby offsets different start

pandas groupby offsets different start - python

I have a simple offset question that I cannot seem to find the answer for in the other previous posts. I am trying to groupby weeks, but the default df.groupby(pd.TimeGrouper('1W')) gives me the groupby starting on Sunday.
Say for instance I want this groupby to start on Tuesday. I tried to naively add pd.DateOffset(days=2) as an additional argument but that did not seem to work.

Offset strings can include a component that specifies when the type of period should start.
In your case, you want W-Tue
df.groupby(pd.TimeGrouper('W-Tue'))

Related

Resample().mean() in Python/Pandas and adding the results to my dataframe when the starting point is missing

I'm pretty new to coding and have a problem resampling my dataframe with Pandas. I need to resample my data ("value") to means for every 10 minutes (13:30, 13:40, etc.). The problem is: The data start around 13:36 and I can't access them by hand because I need to do this for 143 dataframes. Resampling adds the mean at the respective index (e.g. 13:40 for the second value), but because 13:30 is not part of my indices, that value gets lost.
I'm trying two different approaches here: First, I tried every option of resample() (offset, origin, convention, ...). Then I tried adding the missing values manually with a loop, which doesn't run properly because I didn't know how to access the correct spot on the list. The list does include all relevant values though. I also tried adding a row with 13:30 as the index on top of the dataframe but didn't manage to convince Python that my index is legit because it's a timestamp (this is not in the code).
Sorry for the very rough code, it just didn't work in several places which is why I'm asking here.
If you have a possible solution, please keep in mind that it has to function within an already long loop because of the many dataframes I have to work on simultaneously.
Thank you very much!
df["tenminavg"] = df["value"].resample("10Min").mean()
df["tenminavg"] = df["tenminavg"].ffill()
ls1 = df["value"].resample("10Min").mean() #my alternative: list the resampled values in order to eventually access the first relevant timespan
for i in df.index: #this loop doesn't work. It should add the value for the first 10 min
if df["tenminavg"][i]=="nan":
if datetime.time(13,30) <= df.index.time < datetime.time(13,40):
df["tenminavg"][i] = ls1.index.loc[i.floor("10Min")]["value"] #tried to access the corresponding data point in the list
else:
continue

Python / Bloomberg api - historical security data incl. weekends (sat. and sun.)

I'm currently working with an blpapi and trying to get a bdh of an index including weekends. (I'll later need to match this df with another date vector.)
I'm allready using
con.bdh([Index],['PX_LAST'],'19910102', today.strftime('%Y%m%d'), [("periodicitySelection", "DAILY")])
but this will return only weekdays (mon - fr). I know how this works in excel with the bbg function-builder but not sure about the wording within the blpapi.
Since I'll need always the first of each month,
con.bdh([Index],['PX_LAST'],'19910102', today.strftime('%Y%m%d'), [("periodicitySelection", "MONTHLY")])
wont work as well because it will return 28,30,31 and so.
Can anyone help here? THX!

You can use a combination of:
"nonTradingDayFillOption", "ALL_CALENDAR_DAYS" # include all days
"nonTradingDayFillMethod", "PREVIOUS_VALUE" # fill non-trading days with previous value

Python: create new columns from rows based on multiple conditions

I've been poking around a bit and can't see to find a close solution to this one:
I'm trying to transform a dataframe from this:
To this:
Such that remark_code_names with similar denial_amounts are provided new columns based on their corresponding har_id and reason_code_name.
I've tried a few things, including a groupby function, which gets me halfway there.
denials.groupby(['har_id','reason_code_name','denial_amount']).count().reset_index()
But this obviously leaves out the reason_code_names that I need.
Here's a minimum:
pd.DataFrame({'har_id':['A','A','A','A','A','A','A','A','A'],'reason_code_name':[16,16,16,16,16,16,16,22,22],
'remark_code_name':['MA04','N130','N341','N362','N517','N657','N95','MA04','N341'],
'denial_amount':[5402,8507,5402,8507,8507,8507,8507,5402,5402]})

Using groupby() is a good way to go. Use it along with transform() and overwrite the column with name 'remark_code_name. This solution puts all remark_code_names together in the same column.
denials['remark_code_name'] = denials.groupby(['har_id','reason_code_name','denial_amount'])['remark_code_name'].transform(lambda x : ' '.join(x))
denials.drop_duplicates(inplace=True)
If you really need to create each code in their own columns, you could apply another function and use .split(). However you will first need to set the number of columns depending on the max number of codes you find in a single row.

Python - Pandas - Difference between timestamps and period range

I am having troubles understanding the difference between a PeriodIndex and a DateTimeIndex, and when to use which. In particular, it always seemed to be more natural to me to use Periods as opposed to Timestamps, but recently I discovered that Timestamps seem to provide the same indexing capability, can be used with the timegrouper and also work better with Matplotlib's date functionalities. So I am wondering if there is every a reason to use Periods (a PeriodIndex)?

Periods can be use to check if a specific event occurs within a certain period. Basically a Period represents an interval while a Timestamp represents a point in time.
# For example, this will return True since the period is 1Day. This test cannot be done with a Timestamp.
p = pd.Period('2017-06-13')
test = pd.Timestamp('2017-06-13 22:11')
p.start_time < test < p.end_time
I believe the simplest reason for ones to use Periods/Timestamps is whether attributes from a Period and a Timestamp are needed for his/her code.

How to increment a date using Arrow?

I'm using the arrow module to handle datetime objects in Python. If I get current time like this:
now = arrow.now()
...how do I increment it by one day?

Update as of 2020-07-28
Increment the day
now.shift(days=1)
Decrement the day
now.shift(days=-1)
Original Answer
DEPRECATED as of 2019-08-09
https://arrow.readthedocs.io/en/stable/releases.html
0.14.5 (2019-08-09) [CHANGE] Removed deprecated replace shift functionality. Users looking to pass plural properties to the replace function to shift values should use shift instead.
0.9.0 (2016-11-27) [FIX] Separate replace & shift functions
Increment the day
now.replace(days=1)
Decrement the day
now.replace(days=-1)
I highly recommend the docs.

The docs state that shift is to be used for adding offsets:
now.shift(days=1)
The replace method with arguments like days, hours, minutes, etc. seems to work just as shift does, though replace also has day, hour, minute, etc. arguments that replace the value in given field with the provided value.
In any case, I think e.g. now.shift(hours=-1) is much clearer than now.replace.

See documentation
now = arrow.now()
oneDayFromNow = now.replace(days+=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas groupby offsets different start - python

Offset strings can include a component that specifies when the type of period should start. In your case, you want W-Tue df.groupby(pd.TimeGrouper('W-Tue'))

Related

Resample().mean() in Python/Pandas and adding the results to my dataframe when the starting point is missing

Python / Bloomberg api - historical security data incl. weekends (sat. and sun.)

Python: create new columns from rows based on multiple conditions

Python - Pandas - Difference between timestamps and period range

How to increment a date using Arrow?

Categories

Resources