Pandas Frequency of subcategories in a GroupBy - python

I have a DataFrame as shown below:
First I would like to get the overall frequencies of the CODE values, call it FREQ, then frequencies of the CODE values within each AXLE group and call it GROUP_FREQ.
I was able to calculate the FREQ column using the below code:
pivot = df[['AXLES','CODE']].pivot(['CODE']).agg(['count','mean','min','max'])
pivot['FREQ']=grouped_df.AXLES['count']/pivot.AXLES['count'].sum()*100`
this provides a nice grouped DataFrame as shown below:
However, I could not figure out how to calculate the frequencies within each AXLE group using this grouped_by DataFrame in the next step.
I tried:
pivot['GROUPFREQ']=pivot['AXLES','mean']['count']/pivot['AXLES','mean']['count'].sum()*100
However, this gives a KeyError: 'count'.
I may be on the wrong path, and what I am trying to achieve may not be done using groupby. I decided to check with the community after spending a couple of hours of trial and errors. I'd be glad if you could let me know what you think.
Thanks!
EDIT:
Reproducible input DataFrame:
,CODE,AXLES
0,0101,5
1,001,4
2,0110111,8
3,010111,7
4,0100,5
5,0101,5
6,0110111,8
7,00111,6
8,00111,6
9,0110111,8
10,0100,5
11,0110011,8
12,01011,6
13,0110111,8
14,0110111,8
15,011011,7
16,011011,7
17,011011,7
18,01011,6
19,01011,6
Desired Output for pivot DataFrame:
CODE,COUNT,AXLES,FREQ,GROUPFREQ
001,1,4,0.05,1.00
00111,2,6,0.1,0.40
0100,2,5,0.1,0.50
0101,2,5,0.1,0.50
01011,3,6,0.15,0.60
010111,1,7,0.05,0.25
0110011,1,8,0.05,0.17
011011,3,7,0.15,0.75
0110111,5,8,0.25,0.83
For the first line of the output:
001 is seen only once in the whole data set (20 records). Thus FREQ = 1/20 = 0.05
When the data is grouped by AXLES, for the AXLES=4 group, 001 is the only record, thus the GROUPFREQ = 1/1 = 1.00. (The same code cannot occur under various AXLE groups, so 001 only needs to be checked for AXLES=4.)

Do you mean:
pivot['FREQ'] = df.groupby('AXLES').CODE.value_counts(normalize=True).reset_index(level=0,drop=True)
Output:
AXLES FREQ
count mean min max
CODE
1 1 4 4 4 1.000000
100 2 5 5 5 0.500000
101 2 5 5 5 0.500000
111 2 6 6 6 0.400000
1011 3 6 6 6 0.600000
10111 1 7 7 7 0.250000
11011 3 7 7 7 0.750000
110011 1 8 8 8 0.166667
110111 5 8 8 8 0.833333

Related

How to calculate percentage while matching columns on video_id?

I am trying to calculate a percentage of users who view a specific video and those who do not. I managed to calculate the total number of videos and also total number of videos viewed by each group.
However, when I try to calculate the percentages it does not work.
I believe I probably need to match the story ids as the columns do not match anymore after calculating, how do I do that?
This is my formula to calculate percentages:
pd.DataFrame(df.status.eq(3).astype(int).groupby(df.story_id).sum() / df['story_id'].value_counts())
However the results do not make sense as I believe that during the calculations the story_id did not match.
For percentage - sum divide by count is possible use mean - solution is simplify:
print (df)
story_id status
0 1 3
1 1 5
2 1 3
3 2 3
4 2 3
5 4 5
6 4 3
7 5 7
df1 = df.status.eq(3).groupby(df.story_id).mean().reset_index(name='perc')
print (df1)
story_id perc
0 1 0.666667
1 2 1.000000
2 4 0.500000
3 5 0.000000

Smoothing Categorical Output

I have a list of outputs obtained from a cow behavior detection model. Even in a video when a cow is laying, often time it identifies as standing and vice versa. In each video frame, a classification result is given by the model and we are appending it into a list. Let's assume after 20 frames, we have a series of output as follows -
behavious_cow_1 = ["stand","stand","stand","stand","lying", "stand","stand", "eating", "stand","stand","stand","stand","lying""stand","stand","stand","stand","stand","stand","lying"]
Out of 20 classification results, we have 4 misclassification; 3 lyings, and 1 eating. However, the whole time the cow was sitting at a place. If the list only contained numerical values like - 1,2,3..., I would have opted for moving average to change the misclassification. Is there any Scipy, Pandas, Numpy function that can smooth the categorical output? I am thinking about taking previous 3 and next 3 values to determine the current category.
I used the following solution -
import scipy.stats
window_length = 7
behave = ["stand","stand","stand","stand","lying","lying", "eating"]
most_freq_val = lambda x: scipy.stats.mode(x)[0][0]
smoothed = [most_freq_val(behave[i:i+window_length]) for i in range(0,len(behave)-window_length+1)]
I tried the solution posted by Hugolmn but it broke at a point. In the rolling mode, the window width is provided by the user (7 here). In a certain width, if more than one values are present in the same number of times, the code does not work. It's more like - you tried to find the statistical mode (most common item) of a list but it got more than one item with the same highest frequency.
I am myself very surprised that a function such as mode() does not work with a rolling window in pandas. However, I still found a decent solution to your problem
First, create a pandas Series with categorical datatype:
df = pd.Series(sample, dtype='category')
Now you can see that df.cat.categories returns the list of categories in your data, and df.cat.codes the codes associated to them. We can use the latter to apply a rolling mode with a width of 7 (3 previous, the value, and the next 3):
df.cat.codes
0 3
1 3
2 3
3 3
4 1
5 3
6 3
7 0
8 3
9 3
10 3
11 3
12 2
13 3
14 3
15 3
16 3
17 3
18 1
dtype: int8
df.cat.codes.rolling(7, center=True, min_periods=0).apply(lambda x: x.mode())
0 3.0
1 3.0
2 3.0
3 3.0
4 3.0
5 3.0
6 3.0
7 3.0
8 3.0
9 3.0
10 3.0
11 3.0
12 3.0
13 3.0
14 3.0
15 3.0
16 3.0
17 3.0
18 3.0
dtype: float64
Finally, you can map the codes to get the strings back:
(df.cat.codes
.rolling(7, center=True, min_periods=0)
.apply(lambda x: x.mode())
.map(dict(enumerate(df.cat.categories)))
)
0 stand
1 stand
2 stand
3 stand
4 stand
5 stand
6 stand
7 stand
8 stand
9 stand
10 stand
11 stand
12 stand
13 stand
14 stand
15 stand
16 stand
17 stand
18 stand
dtype: object
And there you go ! You recovered your strings after applying a rolling mode on their codes !

Insert row in dataframe between every existing row and it must contain information from the previous and next rows

Okay this one I'm a little bit stuck on.
I have a dataframe like this:
time Throttle Vout
0 1056.65785 1 8
1 1056.66255 2 8
2 1056.66785 3 9
3 1056.67330 4 11
4 1056.67840 5 15
and I need to add a row between every existing row - the whole dataset is about 21000 rows. The time should be equal to the time in the next row. Any other columns should have the values of the previous row.
So the outcome would be something like this:
time Throttle Vout
0 1056.65785 1 8
1 1056.66255 1 8 <---- new row
2 1056.66255 2 8
3 1056.66785 2 8 <---- new row
4 1056.66785 3 9
5 1056.67330 3 9 <---- new row
6 1056.67330 4 11
7 1056.67840 4 11 <---- new row
8 1056.67840 5 15
I've looked into df.apply() but not sure where to start
Serge Ballesta answer:
So this works with the test data supplied above. When I test it on a much larger DataFrame I start to see some errors. I originally thought it was something wrong in my PyCharm but testing with a larger dataset in powershell proved otherwise.
Quang Hoang answer:
So this also worked on a small scale but when using a larger dataset it seemed to have quite a few issues with both time and the other columns. I've highlighted some in the image below. The top df is the original and the bottom is the altered.
Valdi_Bo Answer
The additional columns seemed to work well with this but there seems to be an issue with the times columns on larger datasets. I've highlighted some below.
You can use a combination of concat and ffill:
(pd.concat([df, df[['time']].shift(-1)])
.sort_index(kind='mergesort')
.dropna(how='all')
.ffill()
)
Output:
time Throttle Vout
0 1056.65785 1.0 8.0
0 1056.66255 1.0 8.0
1 1056.66255 2.0 8.0
1 1056.66785 2.0 8.0
2 1056.66785 3.0 9.0
2 1056.67330 3.0 9.0
3 1056.67330 4.0 11.0
3 1056.67840 4.0 11.0
4 1056.67840 5.0 15.0
I would build a copy of the dataframe, shift its time column, concatenate it to the original dataframe and sort the result according to time:
df2 = df.copy()
df2['time'] = df['time'].shift(-1)
result =
df2[~df2['time'].isna()].append(df).sort_values('time').reset_index(drop=True)
It gives as expected:
time Throttle Vout
0 1056.65785 1 8
1 1056.66255 1 8
2 1056.66255 2 8
3 1056.66785 2 8
4 1056.66785 3 9
5 1056.67330 3 9
6 1056.67330 4 11
7 1056.67840 4 11
8 1056.67840 5 15
This might look a bit overwhelming, but the idea is that you merge the original dataframe with its copy whose values in Throttle & Vout columns are shifted by 1:
pd.concat([
df,
df.loc[:,'Throttle':].shift(1).combine_first(df)
]).reset_index().loc[1:,].sort_values(['time', 'Throttle'])
First compute an auxiliary DataFrame - a copy of df, with time column
shifted 1 place up and without the last original row:
df2 = df.copy()
df2.time = df2.time.shift(-1)
df2.dropna(inplace=True)
The result, for your input sample, is:
time Throttle Vout
0 1056.66255 1 8
1 1056.66785 2 8
2 1056.67330 3 9
3 1056.67840 4 11
and these are the new rows to insert.
To get a concatenation of these 2 DataFrames, in proper order, run:
df = pd.concat([df, df2], keys=[1, 2]).swaplevel().sort_index().reset_index(drop=True)
To guarantee the proper order of rows, I added to the previous solution:
keys - to add "origin indicators", but they area added as the top
level of the MultiIndex,
swaplevel - swap levels of the MulitiIndex, to provide proper sort
by index (in the next step).

Detecting the outlier from rows by certain column in panda dataframe

I have datasets which measure voltage values in certain column.
I'm looking for elegant way to extract the rows that is deviated from mean value. There are couple of group in "volt_id" and I'd like to have each group create their own mean/std and use them to decide which rows are deviated from each group.
for example, I have original dataset as below.
time volt_id value
0 14 A 300.00
1 15 A 310.00
2 15 B 200.00
3 16 B 210.00
4 17 B 300.00
5 14 C 100.00
6 16 C 110.00
7 20 C 200.00
After the algorithm running, I'd only keep row 4 and 7 which is highly deviated from their groups as below.
time volt_id value
4 17 B 300.00
7 20 C 200.00
I could do this if there is only single group but my codes would be messy and lengthy if do this for multiple groups. I'd appreciate if there's simpler way to do this.
thanks,
You can compute and filter on the zscore on each group using groupby.
Assuming you want only those rows which are 1 or more standard deviations away from mean,
g = df.groupby('volt_id').value
v = (df.value - g.transform('mean')) / g.transform('std')
df[v.abs().ge(1)]
time volt_id value
4 17 B 300.0
7 20 C 200.0
Similar to #COLDSPEED's solution:
In [179]: from scipy.stats import zscore
In [180]: df.loc[df.groupby('volt_id')['value'].transform(zscore) > 1]
Out[180]:
time volt_id value
4 17 B 300.0
7 20 C 200.0
One way to do this would be using outliers:
http://www.mathwords.com/o/outlier.htm
You would need to define your inner quartile range and first and third quartiles. You could then filter your data onsimple comparison.
Quartiles are not the only way to determine outliers howevet. Heres a discussion comparing standard deviation and quartiles for locating outliers:
https://stats.stackexchange.com/questions/175999/determine-outliers-using-iqr-or-standard-deviation

Perform aggregation and transformation in pandas dataframe based on multiple conditions

I have the following dataframe df1.
import pandas as pd
df1=pd.DataFrame([[1,11,'mx212', 1000], [1,11,'rx321', 600],
[1,11,'/bc1', 5],[1,11,'/bc2', 11], [1,12,'sx234', 800],
[1,12,'mx456', 1232], [3,13,'mx322', 1000], [3,13,'/bc3', 34]],
columns=["sale","order","code","amt"])
sale order code amt
0 1 11 mx212 1000
1 1 11 rx321 600
2 1 11 /bc1 5
3 1 11 /bc2 11
4 1 12 sx234 800
5 1 12 mx456 1232
6 3 13 mx322 1000
7 3 13 /bc3 34
Here, a saleperson can have multiple orders and each order can have multiple codes. I want to aggregate and transform amt based on specific combinations of sale, order and code. A code starting with "/bc" needs to be aggregated with main code value("starting with values like 'mx','rx' etc). Note that any code value not staring with /bc is considered type "main". If there are multiple combinations of code values of type "/bc" and "main" type, the aggregation for amt should be done on each combination(for eq rows 1, 2, 3 and 4 has two combinations of type "main" and "/bc". Note that, a specific order would have equal values of code types "/bc" and "main". Once, the aggregation for an order is done, i want the code type "/bc" to be dropped.
If a particular sale and order has no code type "bc", the values of "amt" should be same. For eq, rows 5 and 6 should be unchanged and code, amt values should remain same.
The resulting dataframe df2 should ideally be this:
sale order code amt
0 1 11 mx212 1005
1 1 11 rx321 611
2 1 12 sx234 800
3 1 12 mx456 1232
4 3 13 mx322 1034
amt value in row 1 is "1000+5" and in row 2 is "600+11"{code type "main" is added to respective "/bc". amt values in row 3 and 4 remains same and in row 5, it is "1000+34".
I know this is a lot of information, but i tried to be as coherent as possible. I would request if there are any questions, please comment. I will appreciate it. Any kind of help is always welcomed :)
You could do it like this:
g=df1.groupby(['sale','order',df1.code.str.startswith('/bc')]).cumcount()
df1.groupby(['sale','order',g],as_index=False)['amt','code']\
.agg({'code':'first','amt':'sum'})
Output:
sale order code amt
0 1 11 mx212 1005
1 1 11 rx321 611
2 1 12 sx234 800
3 1 12 mx456 1232
4 3 13 mx322 1034
I break down the steps...key is building a column help to determine the inner group
df1.code=df1.code.replace({'bc':np.nan},regex=True)
df1['New']=df1.code.isnull()
d1=df1.groupby([df1.sale,df1.order,df1.groupby(['sale','order','New']).cumcount()],as_index=False).amt.sum()
pd.concat([d1,df1.dropna().code.reset_index(drop=True)],1)
Out[344]:
sale order amt code
0 1 11 1005 mx212
1 1 11 611 rx321
2 1 12 800 sx234
3 1 12 1232 mx456
4 3 13 1034 mx322

Categories

Resources