Resample Pandas Dataframe based on defined value - python

I'm trying to set the 'Num' column to a max/min threshold of 10 and reindex the dataframe based on this aggregation.
import pandas as pd
import numpy
df = pd.DataFrame({'Num':[2,12,4,25,5]})
----------------------------------------
Num
0 2
1 12
2 4
3 25
4 5
How can I re-index the Pandas Dataframe so it looks like this?
Num
0 10
1 10
2 10
3 10
4 8
Thanks!

Seems like you need
df = pd.DataFrame({'Num':[2,12,4,25,5]})
s=df.Num.sum()
df.iloc[:s//10,0]=10
df.iloc[-1,0]=10 if s%10==0 else s%10
df
Out[369]:
Num
0 10
1 10
2 10
3 10
4 8

Related

How to plot one columns "usage" of another column in pandas

I would like to plot one variable as a constant, total_cap, in this case and layer on the maxused_capacity and meanused_capacity values. Essentially I would like the visual of a stacked bar plot but I do not want the totals agg'd together, the height of the bar for each site should awlays be just the value of Total_Cap with the other two values layered on
example df:
SITE Total_Cap maxused_Cap meanused_Cap
A 4 3 2
B 8 7 4
C 12 11 5
D 16 13 10
I tried this code but it simply adds the values together when plotting the bar
x= df4[['SITE','maxused_cap','Total_Cap']]
y= x.set_index('SITE')
z=y.groupby('SITE')
z.plot.bar(stacked=True).mean()
plt.show()
IIUC this does what you want to achieve by scaling the values relative to Total_Cap
df.set_index('SITE', inplace=True)
df[['maxused_Cap','meanused_Cap']].div(
(df['maxused_Cap']+df['meanused_Cap'])/df['Total_Cap'],
axis=0).plot.bar(stacked=True, figsize=(8,6));
Out:
Setup the dataframe
import pandas as pd
import io
t = '''
SITE Total_Cap maxused_Cap meanused_Cap
A 4 3 2
B 8 7 4
C 12 11 5
D 16 13 10'''
df = pd.read_csv(io.StringIO(t), sep='\s+')
df
Out:
SITE Total_Cap maxused_Cap meanused_Cap
0 A 4 3 2
1 B 8 7 4
2 C 12 11 5
3 D 16 13 10

How to perform a multiple groupby and transform count with a condition in pandas

This is an extension of the question here: here
I am trying add an extra column to the grouby:
# Import pandas library
import pandas as pd
import numpy as np
# data
data = [['tom', 10,2,'c',100,'x'], ['tom',16 ,3,'a',100,'x'], ['tom', 22,2,'a',100,'x'],
['matt', 10,1,'c',100,'x'], ['matt', 15,5,'b',100,'x'], ['matt', 14,1,'b',100,'x']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category','Rating','Other'])
df['AttemptsbyRating'] = df.groupby(by=['Rating','Other'])['Attempts'].transform('count')
df
Then i try to add another column for the sum of rows that have a Score greater than 1 (which should equal 4):
df['scoregreaterthan1'] = df['Score'].gt(1).groupby(by=df[['Rating','Other']]).transform('sum')
But i am getting a
ValueError: Grouper for '<class 'pandas.core.frame.DataFrame'>' not 1-dimensional
Any ideas? thanks very much!
df['Score'].gt(1) is returning a boolean series rather than a dataframe. You need to return a dataframe first before you can groupby the relevant columns.
use:
df = df[df['Score'].gt(1)]
df['scoregreaterthan1'] = df.groupby(['Rating','Other'])['Score'].transform('count')
df
output:
Name Attempts Score Category Rating Other AttemptsbyRating scoregreaterthan1
0 tom 10 2 c 100 x 6 4
1 tom 16 3 a 100 x 6 4
2 tom 22 2 a 100 x 6 4
4 matt 15 5 b 100 x 6 4
If you want to keep the people who have a score that is not greater than one, then instead of this:
df = df[df['Score'].gt(1)]
df['scoregreaterthan1'] = df.groupby(['Rating','Other'])['Score'].transform('count')
do this:
df['scoregreaterthan1'] = df[df['Score'].gt(1)].groupby(['Rating','Other'])['Score'].transform('count')
df['scoregreaterthan1'] = df['scoregreaterthan1'].ffill().astype(int)
output 2:
Name Attempts Score Category Rating Other AttemptsbyRating scoregreaterthan1
0 tom 10 2 c 100 x 6 4
1 tom 16 3 a 100 x 6 4
2 tom 22 2 a 100 x 6 4
3 matt 10 1 c 100 x 6 4
4 matt 15 5 b 100 x 6 4
5 matt 14 1 b 100 x 6 4

expand pandas groupby results to initial dataframe

Say I have a dataframe df and group it by a few columns, dfg, with the median of one of its columns. How could I then take those median values, and expand them out so that those mean values are in a new column of the original df, and associated with the respective conditions? This will mean there are duplicates, but I will next be using this column for a subsequent calculation and having these in a column will make this possible.
Example data:
import pandas as pd
data = {'idx':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'condition1':[1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4],
'condition2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2],
'values':np.random.normal(0,1,16)}
df = pd.DataFrame(data)
dfg = df.groupby(['idx', 'condition2'], as_index=False)['values'].median()
example of desired result (note duplicates corresponding to correct conditions):
idx condition1 condition2 values medians
0 1 1 1 0.35031 0.656355
1 1 1 2 -0.291736 -0.024304
2 1 2 1 1.593545 0.656355
3 1 2 2 -1.275154 -0.024304
4 1 3 1 0.075259 0.656355
5 1 3 2 1.054481 -0.024304
6 1 4 1 0.9624 0.656355
7 1 4 2 0.243128 -0.024304
8 2 1 1 1.717391 1.155406
9 2 1 2 0.788847 1.006583
10 2 2 1 1.145891 1.155406
11 2 2 2 -0.492063 1.006583
12 2 3 1 -0.157029 1.155406
13 2 3 2 1.224319 1.006583
14 2 4 1 1.164921 1.155406
15 2 4 2 2.042239 1.006583
I believe you need GroupBy.transform with median for new column:
df['medians'] = df.groupby(['idx', 'condition2'])['values'].transform('median')

Comparing the value of a row in a certain column to the values in other columns

Using Pandas
I'm trying to determine whether a value in a certain row is greater than the values in all the other columns in the same row.
To do this I'm looping through the rows of a dataframe and using the 'all' function to compare the values in other columns; but it seems this is throwing an error "string indices must be integers"
It seems like this should work: What's wrong with this approach?
for row in dataframe:
if all (i < row['col1'] for i in [row['col2'], row['col3'], row['col4'], row['col5']]):
row['newcol'] = 'value'
Build a mask and pass it to loc:
df.loc[df['col1'] > df.loc[:, 'col2':'col5'].max(axis=1), 'newcol'] = 'newvalue'
The main problem, in my opinion, is using a loop for vectorisable logic.
Below is an example of how your logic can be implemented using numpy.where.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 9, (5, 10)))
df['new_col'] = np.where(df[1] > df.max(axis=1),
'col1_is_max',
'col1_not_max')
Result:
0 1 2 3 4 5 6 7 8 9 new_col
0 4 1 3 8 3 2 5 1 1 2 col1_not_max
1 2 7 1 2 5 3 5 1 8 5 col1_is_max
2 1 8 2 5 7 4 0 3 6 3 col1_is_max
3 6 4 2 1 7 2 0 8 3 2 col1_not_max
4 0 1 3 3 0 3 7 4 4 1 col1_not_max

Sum all columns with a wildcard name search using Python Pandas

I have a dataframe in python pandas with several columns taken from a CSV file.
For instance, data =:
Day P1S1 P1S2 P1S3 P2S1 P2S2 P2S3
1 1 2 2 3 1 2
2 2 2 3 5 4 2
And what I need is to get the sum of all columns which name starts with P1... something like P1* with a wildcard.
Something like the following which gives an error:
P1Sum = data["P1*"]
Is there any why to do this with pandas?
I found the answer.
Using the data, dataframe from the question:
from pandas import *
P1Channels = data.filter(regex="P1")
P1Sum = P1Channels.sum(axis=1)
List comprehensions on columns allow more filters in the if condition:
In [1]: df = pd.DataFrame(np.arange(15).reshape(5, 3), columns=['P1S1', 'P1S2', 'P2S1'])
In [2]: df
Out[2]:
P1S1 P1S2 P2S1
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [3]: df.loc[:, [x for x in df.columns if x.startswith('P1')]].sum(axis=1)
Out[3]:
0 1
1 7
2 13
3 19
4 25
dtype: int64
Thanks for the tip jbssm, for anyone else looking for a sum total, I ended up adding .sum() at the end, so:
P1Sum= P1Channels.sum(axis=1).sum()

Categories

Resources