Detecting the outlier from rows by certain column in panda dataframe - python

I have datasets which measure voltage values in certain column.
I'm looking for elegant way to extract the rows that is deviated from mean value. There are couple of group in "volt_id" and I'd like to have each group create their own mean/std and use them to decide which rows are deviated from each group.
for example, I have original dataset as below.
time volt_id value
0 14 A 300.00
1 15 A 310.00
2 15 B 200.00
3 16 B 210.00
4 17 B 300.00
5 14 C 100.00
6 16 C 110.00
7 20 C 200.00
After the algorithm running, I'd only keep row 4 and 7 which is highly deviated from their groups as below.
time volt_id value
4 17 B 300.00
7 20 C 200.00
I could do this if there is only single group but my codes would be messy and lengthy if do this for multiple groups. I'd appreciate if there's simpler way to do this.
thanks,

You can compute and filter on the zscore on each group using groupby.
Assuming you want only those rows which are 1 or more standard deviations away from mean,
g = df.groupby('volt_id').value
v = (df.value - g.transform('mean')) / g.transform('std')
df[v.abs().ge(1)]
time volt_id value
4 17 B 300.0
7 20 C 200.0

Similar to #COLDSPEED's solution:
In [179]: from scipy.stats import zscore
In [180]: df.loc[df.groupby('volt_id')['value'].transform(zscore) > 1]
Out[180]:
time volt_id value
4 17 B 300.0
7 20 C 200.0

One way to do this would be using outliers:
http://www.mathwords.com/o/outlier.htm
You would need to define your inner quartile range and first and third quartiles. You could then filter your data onsimple comparison.
Quartiles are not the only way to determine outliers howevet. Heres a discussion comparing standard deviation and quartiles for locating outliers:
https://stats.stackexchange.com/questions/175999/determine-outliers-using-iqr-or-standard-deviation

Related

What is the code to concatenate 2 dataframes in this specific way with python

I'm currently facing a problem, that I can't solve. I spent 6hours trying to find a solution but in the end nothing worked for me, probably because I don't use the wright things. (I'm using python, pandas, numpy)
Imagine that I have 2 dataframes that are the same except that the second one has 5 day less than the other for each cluster. Where "day" and "cluster" are column names which are sorted. And each cluster has a different number of days.
Graphically the situation is: https://i.stack.imgur.com/w8wDk.jpg
Now I want to merge / concatenate in such a way that my dataframe not merge depending on index. I want the first rows of the second dataframe to match the first rows of the first dataframe. Consequently it will induce NA values for the 5 last rows of the second dataframe in the merged one.
Graphically the situation will be: https://i.stack.imgur.com/nFWHa.jpg
How can I proceed to fix this situation ?
Thank you in advance for any kind of help, I've already tried lot of things, I'm really struggling with this problem.
I admit that this is not the prettiest solution, but at least it works. Assuming that the taller and shorter frames are f1 and f2, respectively, the steps are
Create a "fake" frame f with the same height as f1 but with no cluster column.
Gradually fill f at relevant indices taken from f1 with data from f2.
Concatenate the (partially filled) f with f1
To demonstrate this idea, let's say that the two frames are
>>> f1
cluster day A B
0 2 0 1 2
1 2 1 3 4
2 1 2 5 6
3 1 3 7 8
>>> f2
cluster day A B
0 1 5 10 20
1 1 9 30 40
2 2 6 50 60
The code is as follows (where np is numpy)
f = f1.drop('cluster', axis=1).copy() # the fake frame
f[:] = np.nan
f1g = f1.groupby('cluster') # Allow for a second indexing way using cluster id
f2g = f2.groupby('cluster')
clusters1 = f1g.groups.keys()
clusters2 = f2g.groups.keys()
for cluster in (clusters1 & clusters2):
idx1 = f1g.get_group(cluster).index # indices of entries of the current cluster in f1
idx2 = f2g.get_group(cluster).index # indices of entries of the current cluster in f2
m = len(idx2)
f.loc[idx1[0:m]] = f2.loc[idx2[0:m], ['day', 'A', 'B']].to_numpy() # fill the first m entries of current cluster in f with data from f2
And the result after concatenating the fake f and the taller f1
>>> pd.concat([f1, f], axis=1)
cluster day A B day A B
0 2 0 1 2 6.0 50.0 60.0
1 2 1 3 4 NaN NaN NaN
2 1 2 5 6 5.0 10.0 20.0
3 1 3 7 8 9.0 30.0 40.0
Final note: You can obtain idx1 and idx2 in the for loop using ways other than groupby, but I think the latter is one of the fastest ways to do this.

Create averages of column in pandas dataframe based on values in another column [duplicate]

I have a dataframe like this:
cluster org time
1 a 8
1 a 6
2 h 34
1 c 23
2 d 74
3 w 6
I would like to calculate the average of time per org per cluster.
Expected result:
cluster mean(time)
1 15 #=((8 + 6) / 2 + 23) / 2
2 54 #=(74 + 34) / 2
3 6
I do not know how to do it in Pandas, can anybody help?
If you want to first take mean on the combination of ['cluster', 'org'] and then take mean on cluster groups, you can use:
In [59]: (df.groupby(['cluster', 'org'], as_index=False).mean()
.groupby('cluster')['time'].mean())
Out[59]:
cluster
1 15
2 54
3 6
Name: time, dtype: int64
If you want the mean of cluster groups only, then you can use:
In [58]: df.groupby(['cluster']).mean()
Out[58]:
time
cluster
1 12.333333
2 54.000000
3 6.000000
You can also use groupby on ['cluster', 'org'] and then use mean():
In [57]: df.groupby(['cluster', 'org']).mean()
Out[57]:
time
cluster org
1 a 438886
c 23
2 d 9874
h 34
3 w 6
I would simply do this, which literally follows what your desired logic was:
df.groupby(['org']).mean().groupby(['cluster']).mean()
Another possible solution is to reshape the dataframe using pivot_table() then take mean(). Note that it's necessary to pass aggfunc='mean' (this averages time by cluster and org).
df.pivot_table(index='org', columns='cluster', values='time', aggfunc='mean').mean()
Another possibility is to use level parameter of mean() after the first groupby() to aggregate:
df.groupby(['cluster', 'org']).mean().mean(level='cluster')

How can I choose a random sample of size n from values from a single pandas dataframe column, with repeating values occurring a maximum of 2 times?

My dataframe looks like this:
Identifier Strain Other columns, etc.
1 A
2 C
3 D
4 B
5 A
6 C
7 C
8 B
9 D
10 A
11 D
12 D
I want to choose n rows at random while maintaining diversity in the strain values. For example, I want a group of 6, so I'd expect my final rows to include at least one of every type of strain with two strains appearing twice.
I've tried converting the Strain column into a numpy array and using the method random.choice but that didn't seem to run. I've also tried using .sample but it does not maximize strain diversity.
This is my latest attempt which outputs a sample of size 7 in order (identifiers 0-7) and the Strains are all the same.
randomsample = df[df.Strain == np.random.choice(df['Strain'].unique())].reset_index(drop=True)
I believe there's something in numpy that does exactly this, but can't recall which. Here's a fairly fast approach:
Shuffle the data for randomness
enumerate the rows within each group
sort by the enumeration above
slice the top n rows
So in code:
n = 6
df = df.sample(frac=1) # step 1
enums = df.groupby('Strain').cumcount() # step 2
orders = np.argsort(enums) # step 3
samples = df.iloc[orders[:n]] # step 4
Output:
Identifier Strain Other columns, etc.
2 3 D NaN
7 8 B NaN
0 1 A NaN
5 6 C NaN
4 5 A NaN
8 9 D NaN

Pandas Frequency of subcategories in a GroupBy

I have a DataFrame as shown below:
First I would like to get the overall frequencies of the CODE values, call it FREQ, then frequencies of the CODE values within each AXLE group and call it GROUP_FREQ.
I was able to calculate the FREQ column using the below code:
pivot = df[['AXLES','CODE']].pivot(['CODE']).agg(['count','mean','min','max'])
pivot['FREQ']=grouped_df.AXLES['count']/pivot.AXLES['count'].sum()*100`
this provides a nice grouped DataFrame as shown below:
However, I could not figure out how to calculate the frequencies within each AXLE group using this grouped_by DataFrame in the next step.
I tried:
pivot['GROUPFREQ']=pivot['AXLES','mean']['count']/pivot['AXLES','mean']['count'].sum()*100
However, this gives a KeyError: 'count'.
I may be on the wrong path, and what I am trying to achieve may not be done using groupby. I decided to check with the community after spending a couple of hours of trial and errors. I'd be glad if you could let me know what you think.
Thanks!
EDIT:
Reproducible input DataFrame:
,CODE,AXLES
0,0101,5
1,001,4
2,0110111,8
3,010111,7
4,0100,5
5,0101,5
6,0110111,8
7,00111,6
8,00111,6
9,0110111,8
10,0100,5
11,0110011,8
12,01011,6
13,0110111,8
14,0110111,8
15,011011,7
16,011011,7
17,011011,7
18,01011,6
19,01011,6
Desired Output for pivot DataFrame:
CODE,COUNT,AXLES,FREQ,GROUPFREQ
001,1,4,0.05,1.00
00111,2,6,0.1,0.40
0100,2,5,0.1,0.50
0101,2,5,0.1,0.50
01011,3,6,0.15,0.60
010111,1,7,0.05,0.25
0110011,1,8,0.05,0.17
011011,3,7,0.15,0.75
0110111,5,8,0.25,0.83
For the first line of the output:
001 is seen only once in the whole data set (20 records). Thus FREQ = 1/20 = 0.05
When the data is grouped by AXLES, for the AXLES=4 group, 001 is the only record, thus the GROUPFREQ = 1/1 = 1.00. (The same code cannot occur under various AXLE groups, so 001 only needs to be checked for AXLES=4.)
Do you mean:
pivot['FREQ'] = df.groupby('AXLES').CODE.value_counts(normalize=True).reset_index(level=0,drop=True)
Output:
AXLES FREQ
count mean min max
CODE
1 1 4 4 4 1.000000
100 2 5 5 5 0.500000
101 2 5 5 5 0.500000
111 2 6 6 6 0.400000
1011 3 6 6 6 0.600000
10111 1 7 7 7 0.250000
11011 3 7 7 7 0.750000
110011 1 8 8 8 0.166667
110111 5 8 8 8 0.833333

Reduce the dataframe rows and lookup

Need help with the following please.
Suppose we have a dataframe:
dictionary ={'Category':['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
'val1':[11,13,14,17,18,21,22,25,2,8,9,13,15,16,19],
'val2':[1,0,5,1,4,3,5,9,4,1,5,2,4,0,3]}
df=pd.DataFrame(dictionary)
'val1' is always increasing within the same value in 'category', i.e first and last rows of a category are min and max values of that category. There are too many rows per category, and I want to make a new dataframe that includes min and max values of each category and contains equally spaced e.g. 5 rows (incluing min and max) from each category.
I think numpy's linspace should be used to create an array of values for each category (e.g. linspace(min, max, 5)) then something similar to excel's 'lookup' function should be used to get the closest values of 'val1' from df.
Or maybe there are some other better ways...
Many thanks for the help.
Is this what you need ? with groupby and reindex
l=[]
for _, x in df.groupby('Category'):
x.index=x['val1']
y=x.reindex(np.linspace(x['val1'].min(), x['val1'].max(), 5),method='nearest')
l.append(y)
pd.concat(l)
Out[330]:
Category val1 val2
val1
11.00 a 11 1
14.50 a 14 5
18.00 a 18 4
21.50 a 22 5
25.00 a 25 9
2.00 b 2 4
6.25 b 8 1
10.50 b 9 5
14.75 b 15 4
19.00 b 19 3

Categories

Resources