Calculate the sum of the first n rows for each group - python

What I want to do is group by column A and then take the sum of first two rows, then assign that value as a new column. Example below:
DF:
ColA ColB
AA 2
AA 1
AA 5
AA 3
BB 9
BB 3
BB 2
BB 12
CC 0
CC 10
CC 5
CC 3
Desired DF:
ColA ColB NewCol
AA 2 3
AA 1 3
AA 5 3
AA 3 3
BB 9 12
BB 3 12
BB 2 12
BB 12 12
CC 0 10
CC 10 10
CC 5 10
CC 3 10
For AA, it looks at ColB and take the sum of the first two rows and assigns that summed value to newCol. I've tried this by creating a dictionary by looping through the unique ColA values, creating a subset dataframe of the first two rows, summing, then populating the dictionary with values. Then mapping the dictionary back - but my dataframe is VERY big and it takes forever. Any ideas?
Thank you!

You can use transform to get a new value per each row and a lambda function. In lambda you can use head(2) to get first 2 rows for each group and sum() them:
df.groupby('ColA')['ColB'].transform(lambda x: x.head(2).sum())

Related

Tricky apply difference from two dataframes in a specific column using Python

I would like to compare the sum of an original df and rounded df.
If there is a delta from its sum, apply this delta, whether by subtraction or addition to the last quarter.
The first sum difference between AA is 4. (12-8 = 4)
The second sum difference with BB is 2. (14-12 = 2)
Data
original_df
id q121 q221 q321 q421 sum
AA 1 0.5 0.5 6.1 8
BB 1 0.5 6.5 3.1 12
rounded_df
id q121 q221 q321 q421 sum
AA 2 2 2 6 12
BB 2 2 6 4 14
Desired
We've subtracted 4 from 12 to obtain 8 for AA.
We've subtracted 2 from 14 to obtain 12 for BB
(when comparing original to rounded)
Now the new final_df matches the sum of the original_df
final_df
id q121 q221 q321 q421 sum delta
AA 2 2 2 2 8 4
BB 2 2 6 2 12 2
Doing
Compare sum and create delta
final_df['delta'] = np.where(original_df['sum'] ==
rounded_df['sum'], 0, original_df['sum'] - rounded_df['sum'])
Apply delta to last quarter of the year
I am still not sure how to complete thee second half of the problem. I am still researching, any suggestion is appreciated.
using sub, filter, update, iloc
# create the delta with difference b/w sum of the two DF
df2['delta']=df2['sum'].sub(df['sum'])
# subtract the delta from the last quarter, obtained
# using filter
# create a placeholder df3
df3=df2.filter(like='q' ).iloc[:,-1:].sub(df2.iloc[:,-1:].values)
# filter(like='q' ) : Filter columns that has 'q' in their name
# .iloc[:,-1:] : using iloc, choose the last column that has 'q' in their name, -1 gives us last column
# df2.[iloc][1][:,-1:].values : gives the values of the last column of the table
# the subtraction results in DF3,
# update df2 based on df3
df2.update(df3)
df2
# updates will update the values of matching column from df3 into df2
id q121 q221 q321 q421 sum delta
0 AA 2 2 2 2 12 4
1 BB 2 2 6 2 14 2

Pivot select tables in dataframe to make values column headers in Python

I have a dataframe, df, where I would like to transform and pivot select values.
I wish to groupby id and date, sum the 'pwr' values and then count the type values.
df
df values that will be column headers: 'hi', 'hey'
id date type pwr de_id de_date de_type de_pwr base base_pos
aa q1 hey 10 aa q1 hey 5 200 40
aa q1 hi 5 200 40
aa q1 hey 5 200 40
aa q2 hey 2 aa q2 hey 3 200 40
aa q2 hey 2 aa q2 hey 3 200 40
bb q1 0 bb q1 hi 6 500 10
bb q1 0 bb q1 hi 6 500 10
Desired
id date hey hi total sum hey hi totald desum base base_pos
aa q1 2 1 3 20 1 0 1 5 200 40
aa q2 2 0 2 4 2 0 2 6 200 40
bb q1 0 0 0 0 0 2 2 12 500 10
Doing
sum1 = df.groupby(['id','date']).agg({'pwr': 'sum', 'type': 'count', 'de_pwr': 'sum', 'de_type': 'count'})
pd.pivot_table(df, values = '' , columns = 'type')
Any suggestion will be helpful.
So, this is definitely not a 'clean' way to go around it, but since you have 2 separate totals summing along columns, I don't know how much cleaner it could get (and the output seems accurate).
You don't mention what aggregation you use to get base and base_pos values, so I went with mean (might need to change it).
type_col = pd.crosstab(index = [df['id'], df['date']], columns = df['type'])
type_col['total'] = type_col.sum(axis = 1)
pwr_sum = df.groupby(['id','date'])['pwr'].sum()
de_type_col = pd.crosstab(index = [df['id'], df['date']], columns = df['de_type'])
de_type_col['total_de'] = de_type_col.sum(axis = 1)
pwr_de_sum = df.groupby(['id','date'])['de_pwr'].sum()
base_and_pos = df.groupby(['id','date'])[['base','base_pos']].mean()
out = pd.concat([type_col, pwr_sum, de_type_col, pwr_de_sum, base_and_pos], axis = 1).fillna(0).astype('int')
Essentially use crosstab to get value counts and sum them along columns. The index of resulting DataFrame is the same as groupby(['id','date']), so you can then concatenate results of groupby without issue. Repeat the same process for de columns, apply groupby with your choice of aggregation to base and base_pos columns, and concatenate all results along axis = 1. Obviously, you can group some operations together (such as pwr sum, de_pwr sum and base/base_pos aggregation), but you'll need to reorder your columns after that to get the desired order.
Output:
id date hey hi total pwr hey hi total_de de_pwr base base_pos
aa q1 2 1 3 20 1 0 1 5 200 40
aa q2 2 0 2 4 2 0 2 6 200 40
bb q1 0 0 0 0 0 2 2 12 500 10

How can I get the sum of several columns grouping by other columns?

What I'm trying to do is replicate this SQL code to Python:
select column_1, column_2, column_3,
sum(column_4) as sum_column_4, sum(column_5) as sum_column_5
from df
group by 1,2,3;
In other words, I need to make this data frame:
column_1 column_2 colunn_3 column_4 column_5
AA BB CC 5 3
AA BB CC 5 0
AA BB CC 7 3
AA DD EE 5 2
AA DD EE 7 1
DD EE FF 2 8
DD EE FF 1 0
Look like this:
column_1 column_2 colunn_3 sum_column_4 sum_column_5
AA BB CC 17 6
AA DD EE 12 3
DD EE FF 3 8
Also, I'm trying to make this as simple as possible, because I actually have a lot of columns. And I need to have a new Pandas data frame as output.
So this is what I've tried:
df.groupby(list(df.columns)[0:3]).sum()
It's almost there, the problem is that the output gets weird, something like:
column_1 column_2 colunn_3 sum_column_4 sum_column_5
AA BB CC 17 6
DD EE 12 3
DD EE FF 3 8
I'm trying different things that I've seen in other posts, like Pandas DataFrame Groupby two columns and get counts and Python Pandas group by multiple columns, mean of another - no group by object, but that didn't work. So if anyone could help me.
Like #Quang mentioned in comments, you need to reset indexing:
df.groupby(list(df.columns)[0:3]).sum().reset_index()
When you groupby multiple columns at once, you create a hierarchical multi-indexing and that is why you see column_1 groups index AA.
output:
column_1 column_2 colunn_3 column_4 column_5
0 AA BB CC 17 6
1 AA DD EE 12 3
2 DD EE FF 3 8

Counting mode occurrences for all columns in a dataframe

I have a dataframe that looks like below.
dataframe1 =
In AA BB CC
0 10 1 0
1 11 2 3
2 10 6 0
3 9 1 0
4 10 3 1
5 1 2 0
now I want to create a dataframe that gives me the count of modes for each column, for column AA the count is 3 for mode 10, for columns CC the count is 4 for mode 0, but for BB there are two modes 1 and 2, so for BB I want the sum of counts for the modes. so for BB the count is 2+2=4, for mode 1 and 2.
Therefore the final dataframe that I want looks like below.
Columns Counts
AA 3
BB 4
CC 4
How to do it?
Another slightly more scalable solution using list comprehension:
pd.concat([df.eq(x) for _, x in df.mode().iterrows()]).sum()
[out]
AA 3
BB 4
CC 4
dtype: int64
You can compare columns with modes and count matches by sum:
df = pd.DataFrame({'Columns': df.columns,
'Val':[df[x].isin(df[x].mode()).sum() for x in df]})
print (df)
Columns Val
0 AA 3
1 BB 4
2 CC 4
First we get the modes of the columns with DataFrame.mode
Then we compare each column to it's modes and use Series.isin to check the amount of modes and sum these.
modes = df.iloc[:, 1:].mode()
data = {col: df[col].isin(modes[col]).sum() for col in df.iloc[:, 1:].columns}
df = pd.DataFrame.from_dict(data, orient='index', columns=['Counts'])
Counts
AA 3
BB 4
CC 4
Used pyjanitor module to bring in the transform function and return a dataframe:
(df.melt(id_vars='In')
.groupby('variable')
.agg(numbers=('value','value_counts'))
.groupby_agg(by='variable',
#here, it subtracts the max of numbers(for each group) from each
number in the group
agg = lambda x : x - x.max(),
agg_column_name='numbers',
new_column_name = 'test'
)
.query('test==0')
.groupby('variable')
.agg(count=('numbers','sum'))
)
count
variable
AA 3
BB 4
CC 4

Split dataframe into smaller dataframes based on range of values

I have the following dataframe:
x text
1 500 aa
2 550 bb
3 700 cc
4 750 dd
My goal is to split this df if the x-values are more than 100 points apart.
Is there a pandas function that allows you to make a split based on range of values?
Here is my desired output:
df_1:
x text
0 500 aa
1 550 bb
df_2:
x text
0 700 cc
1 750 dd
I believe you need convert groupby object to tuple and dictionary by helper Series:
d = dict(tuple(df.groupby(df['x'].diff().gt(100).cumsum())))
print (d)
{0: x text
1 500 aa
2 550 bb, 1: x text
3 700 cc
4 750 dd}
Detail:
First get difference by Series.diff, compare by Series.gt for greater and create consecutive groups by Series.cumsum:
print (df['x'].diff().gt(100).cumsum())
1 0
2 0
3 1
4 1
Name: x, dtype: int32
make a new column with shift(1) and then separate by the difference between the values of these columns

Categories

Resources