Need help with the following please.
Suppose we have a dataframe:
dictionary ={'Category':['a','a','a','a','a','a','a','a','b','b','b','b','b','b','b'],
'val1':[11,13,14,17,18,21,22,25,2,8,9,13,15,16,19],
'val2':[1,0,5,1,4,3,5,9,4,1,5,2,4,0,3]}
df=pd.DataFrame(dictionary)
'val1' is always increasing within the same value in 'category', i.e first and last rows of a category are min and max values of that category. There are too many rows per category, and I want to make a new dataframe that includes min and max values of each category and contains equally spaced e.g. 5 rows (incluing min and max) from each category.
I think numpy's linspace should be used to create an array of values for each category (e.g. linspace(min, max, 5)) then something similar to excel's 'lookup' function should be used to get the closest values of 'val1' from df.
Or maybe there are some other better ways...
Many thanks for the help.
Is this what you need ? with groupby and reindex
l=[]
for _, x in df.groupby('Category'):
x.index=x['val1']
y=x.reindex(np.linspace(x['val1'].min(), x['val1'].max(), 5),method='nearest')
l.append(y)
pd.concat(l)
Out[330]:
Category val1 val2
val1
11.00 a 11 1
14.50 a 14 5
18.00 a 18 4
21.50 a 22 5
25.00 a 25 9
2.00 b 2 4
6.25 b 8 1
10.50 b 9 5
14.75 b 15 4
19.00 b 19 3
Related
Input data:
no Group Value
1 A 5
2 B 10
3 A 7
4 B 20
5 A 8
6 B 30
7 A NaN
8 B NaN
9 A 90
10 B 105
How can I apply custom python function (let's call it "custom_fnc") only to the rows which "Value" field is NaN, but that function would accept Series of the last two rows within the group?
For example, I would like to calculate "Value" only for the 7th and 8th row (also because of the performance I don't want to calculate it for the whole dataset), so the function would work with this data:
For the group A, it would need only 3rd and 5th row "Value"
For the group B, it would need only 4th and 6th row "Value"
I was wondering how I can use groupby and rolling, but only for filtered rows that have NaN as the Value?
I have a dataframe like this:
cluster org time
1 a 8
1 a 6
2 h 34
1 c 23
2 d 74
3 w 6
I would like to calculate the average of time per org per cluster.
Expected result:
cluster mean(time)
1 15 #=((8 + 6) / 2 + 23) / 2
2 54 #=(74 + 34) / 2
3 6
I do not know how to do it in Pandas, can anybody help?
If you want to first take mean on the combination of ['cluster', 'org'] and then take mean on cluster groups, you can use:
In [59]: (df.groupby(['cluster', 'org'], as_index=False).mean()
.groupby('cluster')['time'].mean())
Out[59]:
cluster
1 15
2 54
3 6
Name: time, dtype: int64
If you want the mean of cluster groups only, then you can use:
In [58]: df.groupby(['cluster']).mean()
Out[58]:
time
cluster
1 12.333333
2 54.000000
3 6.000000
You can also use groupby on ['cluster', 'org'] and then use mean():
In [57]: df.groupby(['cluster', 'org']).mean()
Out[57]:
time
cluster org
1 a 438886
c 23
2 d 9874
h 34
3 w 6
I would simply do this, which literally follows what your desired logic was:
df.groupby(['org']).mean().groupby(['cluster']).mean()
Another possible solution is to reshape the dataframe using pivot_table() then take mean(). Note that it's necessary to pass aggfunc='mean' (this averages time by cluster and org).
df.pivot_table(index='org', columns='cluster', values='time', aggfunc='mean').mean()
Another possibility is to use level parameter of mean() after the first groupby() to aggregate:
df.groupby(['cluster', 'org']).mean().mean(level='cluster')
I know there's some questions about this topic (like Pandas: Cumulative sum of one column based on value of another) however, none of them fuull fill my requirements.
Let's say I have a dataframe like this one
.
I want to compute the cumulative sum of Cost grouping by month, avoiding taking into account the current value, in order to get the Desired column.By using groupby and cumsum I obtain colum CumSum
.
The DDL to generate the dataframe is
df = pd.DataFrame({'Month': [1,1,1,2,2,1,3],
'Cost': [5,8,10,1,3,4,1]})
IIUC you can use groupby.cumsum and then just subtract cost;
df['cumsum_'] = df.groupby('Month').Cost.cumsum().sub(df.Cost)
print(df)
Month Cost cumsum_
0 1 5 0
1 1 8 5
2 1 10 13
3 2 1 0
4 2 3 1
5 1 4 23
6 3 1 0
You can do the following:
df['agg']=df.groupby('Month')['Cost'].shift().fillna(0)
df['Cumsum']=df['Cost']+df['agg']
After performing the groupby on two columns (id and category) using the mean aggregation function over a column (col3) I have something like this:
col3
id category mean
345 A 12
B 2
C 3
D 4
Total 21
What I would like to do is to add a new column called percentage in which I calculate the percentage of each category over the category Total.
This should be done separately for every id.
The result should be something like this:
col3
id category mean percentage
345 A 12 0.57
B 2 0.09
C 3 0.14
D 4 0.19
Total 21 1
Obviously i want to do that for every id, that is the first column on which i have done the groupby. Any suggestion on how to do that?
Using get_level_values filter your df, then we using div
s=df[df.index.get_level_values(level=1)!='Total'].sum(level=0)
df['percentage']=df.div(s,level=0,axis=1)
df
Out[422]:
mean percentage
id category
345 A 12 0.571429
B 2 0.095238
C 3 0.142857
D 4 0.190476
Total 21 1.000000
That's my suggestion:
df['mean'] = df['mean'] / df['mean'].sum()
I have datasets which measure voltage values in certain column.
I'm looking for elegant way to extract the rows that is deviated from mean value. There are couple of group in "volt_id" and I'd like to have each group create their own mean/std and use them to decide which rows are deviated from each group.
for example, I have original dataset as below.
time volt_id value
0 14 A 300.00
1 15 A 310.00
2 15 B 200.00
3 16 B 210.00
4 17 B 300.00
5 14 C 100.00
6 16 C 110.00
7 20 C 200.00
After the algorithm running, I'd only keep row 4 and 7 which is highly deviated from their groups as below.
time volt_id value
4 17 B 300.00
7 20 C 200.00
I could do this if there is only single group but my codes would be messy and lengthy if do this for multiple groups. I'd appreciate if there's simpler way to do this.
thanks,
You can compute and filter on the zscore on each group using groupby.
Assuming you want only those rows which are 1 or more standard deviations away from mean,
g = df.groupby('volt_id').value
v = (df.value - g.transform('mean')) / g.transform('std')
df[v.abs().ge(1)]
time volt_id value
4 17 B 300.0
7 20 C 200.0
Similar to #COLDSPEED's solution:
In [179]: from scipy.stats import zscore
In [180]: df.loc[df.groupby('volt_id')['value'].transform(zscore) > 1]
Out[180]:
time volt_id value
4 17 B 300.0
7 20 C 200.0
One way to do this would be using outliers:
http://www.mathwords.com/o/outlier.htm
You would need to define your inner quartile range and first and third quartiles. You could then filter your data onsimple comparison.
Quartiles are not the only way to determine outliers howevet. Heres a discussion comparing standard deviation and quartiles for locating outliers:
https://stats.stackexchange.com/questions/175999/determine-outliers-using-iqr-or-standard-deviation