Pandas query after groupby and agg function

Pandas query after groupby and agg function - python

I have the following piece of code already, basically I'm grouping by week and then by the tenant, then manupulating the data by calculating the p95 and p90, now I only want to print violators, as in when p90 is >10 or p95 is > 50. I tried using query after manupulating the data frame, it doesn't work as it complains about key error. Any way I can run my computations inside the agg function itself ? Below is the code I have come up with :
total_logs_percentiles = data.groupby(['version_week', 'tenant']).agg({'time_in_queue': {'p90': p90_agg, 'p95': p95_agg},
'duration': {'p90': p90_agg, 'p95': p95_agg}})
This is an example data output from the above expression:
time_in_queue duration
p90 p95 p90 p95
version_week tenant
1 google 0.9 22 44 0.5
yahoo 12 21 4 0.5
bing 0.5 22 5 0.5
duckduckgo 0.7 23 4 0.5
IE 25 24 46 0.5
Edge 60 25 47 0.5
Then I'm doing a query to filter the violators like below, but it doesn't work and complains with key error, how do I fix it ?
total_access_logs_percentiles.query('time_in_queue.p90 > 10', engine='python')
How do I fix the above and also optimizze this in such way that I can filter the values out in the agg function itself ? p90_agg and p95_agg are functions I already have.

Related

Pandas str.extract returning NaN

I have the following df
Trends Value
2021-12-13T08:00:00.000Z 45
2021-12-13T07:00:00.000Z 32
2021-12-13T06:42:10.000Z 23
2021-12-13T06:27:00.000Z 45
2021-12-10T05:00:00.000Z 23
I ran the following line:
df['Trends'].str.extract('^(.*:[1-9][1-9].*)$', expand=True)
It returns:
0
NaN
NaN
2021-12-13T06:42:10.000Z
2021-12-13T06:27:00.000Z
NaN
My objective is to use the regex, extract any trends that have minutes and seconds more than zero. The regex works (tested) and the line also work, but what I don't understand is why is it returning NaN when it does not match? I looked through several other SO and the line is pretty much the same.
My expected outcome:
2021-12-13T06:42:10.000Z
2021-12-13T06:27:00.000Z

Your solution is close; you can get matches with str.match, then filter:
df[df.Trends.str.match('^(.*:[1-9][1-9].*)$')].Trends
output:
2 2021-12-13T06:42:10.000Z
3 2021-12-13T06:27:00.000Z

previous answer won't work with the following data (where minute is 00 but second is not, or vice versa), but will work with this updated regex.
df[df.Trends.str.match('^(?!.*:00:00\..*)(.*:[0-9]+:[0-9]+\..*)$')].Trends
or
df[df.Trends.str.match('^(?!.*:00:00\..*)(.*:.*\..*)$')].Trends
or if second doesn't matter, but 01 minute should be selected then
df[df.Trends.str.match('^(?!.*:00:\d+\..*)(.*:.*\..*)$')].Trends
Trends Value
2021-12-13T07:00:00.000Z 32
2021-12-13T07:00:01.000Z 32
2021-12-13T07:00:10.000Z 32
2021-12-13T07:01:00.000Z 32
2021-12-13T07:10:00.000Z 32

Finding MACD Divergence

I want to create a loop to automate finding MACD divergence with specific scenario/criterion, but I am finding it difficult to execute although its very easy to spot when looking at chart by eyes. Note: you can easily get this as ready available scanner but i want to improve my python knowledge, hope someone will be able to help me here with this mission.
My main issue is how to make it reference 40 rows up, and test forward - couldn't get my head around the logic itself.
The rules are as follow: lets say we have the table below
Date
Price
MACD Hist
04/08/2021
30
1
05/08/2021
29
0.7
06/08/2021
28
0.4
07/08/2021
27
0.1
08/08/2021
26
-0.15
09/08/2021
25
-0.70
10/08/2021
26
-0.1
11/08/2021
27
0.2
12/08/2021
28
0.4
13/08/2021
29
0.5
14/08/2021
30
0.55
15/08/2021
31
0.6
16/08/2021
30
0.55
17/08/2021
29
0.5
18/08/2021
28
0.4225
19/08/2021
27
0.4
20/08/2021
26
0.35
21/08/2021
25
0.3
22/08/2021
24
0.25
23/08/2021
23
0.2
24/08/2021
22
0.15
25/08/2021
21
0.1
26/08/2021
20
0.05
27/08/2021
19
0
28/08/2021
18
-0.05
29/08/2021
17
-0.1
30/08/2021
16
-0.25
i want the code to:
look back 40 days from today, within these 40 days get the lowest
point reached in MACDHist and Price corresponding to it(i.e. price 25$ on
09/08/2021 in this example and the MACDHist -0.7)
compare it with today's price & MACDHist and give divergence or not based on below 3 rules:
If today's price < the recorded price in point 1 (16$ < 25$ in this example) AND
Today's MACDHist > the recorded MACD in Absolute terms in point 1 (ABS(-0.7) > ABS(-0.20)) AND
During the same period we recorded those Price and MACDHist (between 09/08/2021 and today) the MACDHist was positive at least once.
I am sorry if my explanation isn't very clear, for that the below picture might help illustrate the scenario I am after:
A. The Lowes MACDHist in specfied period
B. Within the same period, MACDHist were positive at least once
C. Price is lower than in point A (Price C is lower than A) and MACDHist was higher than MACDHist in Point A (i.e. Lower in ABS terms)

In a similar case i have used backtrader. Its a feature-rich Python framework for backtesting and trading and you can also use it in order to generate lots of predefined indicators. In addition with this framework you are able to develop your own custom indicator as shown here. Its very easy to use and it supports lots of data formats like pandas data frames. Please take a look!

I found the answer in this great post. its not direct implementation but at least the logic is the same and by replacing RSI info with MACDHist you get to the same conclusion.
How to implement RSI Divergence in Python

Python: How to allocate a 'misc' total between other categories

I'm building a report in Python to automate a lot of manual transformation we do in Excel at the moment. I'm able to extract the data and pivot it, to get something like this
Date
Category 1
Category 2
Category 3
Misc
01/01/21
40
30
30
10
02/01/21
30
20
50
20
Is it possible to divide the misc total for each date in to the other categories by ratio? So I would end up with the below
Date
Category 1
Category 2
Category 3
01/01/21
44
33
33
02/01/21
36
24
60
The only way I can think of is to split the misc values off to their own table, work out the ratios of the other categories, and then add misc * ratio to each category value, but I just wondered if there was a function I could use to condense the working on this?
Thanks

I think your solution hits the nail on the head. However it can be quite dense already:
>>> cat = df.filter(regex='Category')
>>> df.update(cat + cat.mul(df['Misc'] / cat.sum(axis=1), axis=0))
>>> df.drop(columns=['Misc'])
Date Category 1 Category 2 Category 3
0 01/01/21 44.0 33.0 33.0
1 02/01/21 36.0 24.0 60.0
cat.mul(df['Misc'] / cat.sum(axis=1), axis=0) gets you the reallocated misc values per row, since you multiply each value by misc and divide it by the row total. .mul() allows to do the the multiplication while specifying along which axis, the rest is about having the right columns.

group_by output conversion to data frame issues

So I am not sure if I am taking the best approach to solve this problem, but this is what I have so far:
This is the df that I am working with:
calls.head()
id user_id call_date duration
0 1000_93 1000 2018-12-27 9.0
1 1000_145 1000 2018-12-27 14.0
2 1000_247 1000 2018-12-27 15.0
3 1000_309 1000 2018-12-28 6.0
4 1000_380 1000 2018-12-30 5.0
I am trying to figure out how to create a data frame that tells me how many times a user made a call in a month. This is the code I used to generate that:
calls_per_month = calls.groupby(['user_id',calls['call_date'].dt.month])['call_date'].count()
calls_per_month.head(10)
user_id call_date
1000 12 16
1001 8 27
9 49
10 65
11 64
12 56
1002 10 11
11 55
12 47
1003 12 149
Name: call_date, dtype: int64
Now, the issue is that I need to do further calculations with the user_id attributes of other data frames, so I need to be able to access the total I computed in this table. However it seems like the table I created is not a dataframe, which is not allowing me to do so. This is a solution I tried:
calls_per_month = calls.groupby(['user_id',calls['call_date'].dt.month])['call_date'].count().reset_index()
#(calls_per_month.to_frame()).columns = ['user_id','date','total_calls']
calls_per_month.columns = ['user_id','date','total_calls']
(I tried with and without to_frame)
But I got the following error:
cannot insert call_date, already exists
Please suggest the best way to go about solving this issue. Considering that I have other dataframes with user_id and attributes like 'data used' how do I make this data frame such that I can do computations like total_use = calls['total_calls']*internet['data_used] for each user_id?
Thank you.

Use rename for change level name, so Series.reset_index working correctly:
calls_per_month = (calls.groupby(['user_id',
calls['call_date'].dt.month.rename('month')])['call_date']
.count()
.reset_index())

Pandas timeseries bins and indexing

I have some experimental data collected from a number of samples at set time intervals, in a dataframe organised like so:
Studynumber Time Concentration
1 20 80
1 40 60
1 60 40
2 15 95
2 44 70
2 65 30
Although the time intervals are supposed to be fixed, there is some variation in the data based on when they were actually collected. I want to create bins of the Time column, calculate an 'average' concentration, and then compare the difference between actual concentration and average concentration for each studynumber, at each time.
To do this, I created a column called 'roundtime', then used a groupby to calculate the mean:
data['roundtime']=data['Time'].round(decimals=-1)
meanconc = data.groupby('roundtime')['Concentration'].mean()
This gives a pandas series of the mean concentrations, with roundtime as the index. Then I want to get this back into the main frame to calculate the difference between each actual concentration and the mean concentration:
data['meanconcentration']=meanconc.loc[data['roundtime']].reset_index()['Concentration']
This works for the first 60 or so values, but then returns NaN for each entry, I think because the index of data is longer than the index of meanconcentration.
On the one hand, this looks like an indexing issue - equally, it could be that I'm just approaching this the wrong way. So my question is: a) can this method work? and b) is there another/better way of doing it? All advice welcome!

Use transform to add a column from a groupby aggregation, this will create a Series with it's index aligned to the original df so you can assign it back correctly:
In [4]:
df['meanconcentration'] = df.groupby('roundtime')['Concentration'].transform('mean')
df
Out[4]:
Studynumber Time Concentration roundtime meanconcentration
0 1 20 80 20 87.5
1 1 40 60 40 65.0
2 1 60 40 60 35.0
3 2 15 95 20 87.5
4 2 44 70 40 65.0
5 2 65 30 60 35.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas query after groupby and agg function - python

Related

Pandas str.extract returning NaN

Finding MACD Divergence

Python: How to allocate a 'misc' total between other categories

group_by output conversion to data frame issues

Pandas timeseries bins and indexing

Categories

Resources