Pandas groupby over consecutive duplicates - python

Given a table,
Id
Value
1
1
2
2
2
3
3
4
4
5
4
6
2
8
2
3
1
1
Instead of a simple groupby('Id').agg({'Value':'sum'}) which would perform aggregation over all the instances and yield a table with only four rows, I wish the result to aggregate only over the nearby instances and hence maintaining the order the table was created.
The expected output is following,
Id
Value
1
1
2
5
3
4
4
11
2
11
1
1
If not possible with pandas groupby, any other kind of trick would also be greatly appreciated.
Note: If the above example is not helpful, basically what I want is to somehow compress the table with aggregating over 'Value'. The aggregation should be done only over the duplicate 'Id's which occur one exactly after the other.

Unfortunately, the answers from eshirvana and wwnde doesn't seem to work for a long dataset. Inspired from answer of wwnde, I found a workaround,
# create a series referring to group of identicals
new=[]
i=-1
for item in df.Id:
if item !=seen:
i+=1
seen=items
new.append(i)
df['temp']=new
Now, we groupby over 'temp' column.
df.groupby('temp').agg({'Id':max, 'Value':sum}).reset_index(drop=True)

Related

Pandas DataFrame MultiIndex Pivot - Remove Empty Headers and Axis Rows

this is closely related to the question I asked earlier here Python Pandas Dataframe Pivot Table Column and Values Order. Thanks again for the help. Very much appreciated.
I'm trying to automate a report that will be distributed via email to a large audience so it needs to look "pretty" :)
I'm having trouble resetting/removing the Indexes and/or Axis post-Pivots to enable me to use the .style CSS functions (i.e. creating a Styler Object out of the df) to make the table look nice.
I have a DataFrame where two of the principal fields (in my example here they are "Name" and "Bucket") will be variable. The desired display order will also change (so it can't be hard-coded) but it can be derived earlier in the application (e.g. "Name_Rank" and "Bucket_Rank") into Integer "Sorting Values" which can be easily sorted (and theoretically dropped later).
I can drop the column Sorting Value but not the Row/Header/Axis(?). Additionally, no matter what I try I just can't seem to get rid of the blank row between the headers and the DataTable.
I (think) I need to set the Index = Bucket and Headers = "Name" and "TDY/Change" to use the .style style object functionality properly.
import pandas as pd
import numpy as np
data = [
['AAA',2,'X',3,5,1],
['AAA',2,'Y',1,10,2],
['AAA',2,'Z',2,15,3],
['BBB',3,'X',3,15,3],
['BBB',3,'Y',1,10,2],
['BBB',3,'Z',2,5,1],
['CCC',1,'X',3,10,2],
['CCC',1,'Y',1,15,3],
['CCC',1,'Z',2,5,1],
]
df = pd.DataFrame(data, columns =
['Name','Name_Rank','Bucket','Bucket_Rank','Price','Change'])
display(df)
Name
Name_Rank
Bucket
Bucket_Rank
Price
Change
0
AAA
2
X
3
5
1
1
AAA
2
Y
1
10
2
2
AAA
2
Z
2
15
3
3
BBB
3
X
3
15
3
4
BBB
3
Y
1
10
2
5
BBB
3
Z
2
5
1
6
CCC
1
X
3
10
2
7
CCC
1
Y
1
15
3
8
CCC
1
Z
2
5
1
Based on the prior question/answer I can pretty much get the table into the right format:
df2 = (pd.pivot_table(df, values=['Price','Change'],index=['Bucket_Rank','Bucket'],
columns=['Name_Rank','Name'], aggfunc=np.mean)
.swaplevel(1,0,axis=1)
.sort_index(level=0,axis=1)
.reindex(['Price','Change'],level=1,axis=1)
.swaplevel(2,1,axis=1)
.rename_axis(columns=[None,None,None])
).reset_index().drop('Bucket_Rank',axis=1).set_index('Bucket').rename_axis(columns=
[None,None,None])
which looks like this:
1
2
3
CCC
AAA
BBB
Price
Change
Price
Change
Price
Change
Bucket
Y
15
3
10
2
10
2
Z
5
1
15
3
5
1
X
10
2
5
1
15
3
Ok, so...
A) How do I get rid of the Row/Header/Axis(?) that used to be "Name_Rank" (e.g. the integer "Sorting Values" 1,2,3). I figured a hack where the df is exported to XLS/re-imported with Header=(1,2) but that can't be the best way to accomplish the objective.
B) How do I get rid of the blank row above the data in the table? From what I've read online it seems like you should "rename_axis=[None]" but this doesn't seem to work no matter which order I try.
C) Is there a way to set the Header(s) such that the both what used to be "Name" and "Price/Change" rows are Headers so that the .style functionality can be employed to format them separate from the data in the table below?
Thanks a lot for whatever suggestions anyone might have. I'm totally stuck!
Cheers,
Devon
In pandas 1.4.0 the options for A and B are directly available using the Styler.hide method:

pandas how to multiply all elements of same column in python

I know it is a simple answer but i could'nt find anywhere, I need to show the multiplication of all values of a single column in python.
Here's the dataframe:
VALUE
0 2
1 3
2 1
3 3
4 1
The output should give me 23131 = 18
Try prod
df.VALUE.prod()
Out[345]: 18
To add to the previous answer you can use df.product(axis=0) as well.

Get values of two different columns based on a condition in a third column

I have a certain condition (Incident = yes) and I want to know the values in two columns fulfilling this condition. I have a very big data frame (many rows and many columns) and I am looking for a "screening" function.
To illustrate the following example with the df (which has many more columns than shown):
Repetition Step Incident Test1 Test2
1 1 no 10 20
1 1 no 9 11
1 2 yes 9 19
1 2 yes 11 20
1 2 yes 12 22
1 3 yes 9 18
1 3 yes 8 18
What I would like to get as an answer is
Repetition Step
1 2
1 3
If I only wanted to know the Step, I would use the following command:
df[df.Incident == 'yes'].Step.unique()
Is there a similar command to get the values of two columns for a specific condition?
Thanks for the help! :-)
You could use the query option for the condition, select the interested columns, and finally remove duplicate values
df.query('Incident=="yes"').filter(['Repetition','Step']).drop_duplicates()
OR
you could use the Pandas' loc method, set the rows as the condition, set the columns part with the columns you are interested in, then drop the duplicates.
df.loc[df.Incident=="yes",['Repetition','Step']].drop_duplicates()

Addition with nested columns in python

I have a pandas groupby object that I made from a larger dataframe, in which amounts are grouped under a person ID variable as well as whether it was an ingoing or outgoing transaction. Heres an example:
ID In_Out Amount
1 In 5
1 Out 8
2 In 4
2 Out 2
3 In 3
3 Out 9
4 Out 8
(sorry I don't know how to put actual sample data in). Note that some folks can have one or the other (e.g., maybe they have some going out but nothing coming in).
All I want to go is get the difference in the amounts, collapsed under the person. So the ideal output would be, perhaps a dictionary or other dataframe, containing the difference in amounts under each person, like this:
ID Difference
1 -3
2 2
3 -6
4 -8
I have tried a handful of different ways to do this but am not sure how to work with these nested lists in python.
Thanks!
We couold select the rows that are Out and convert them to negative integers and then use sum().
import pandas as pd
s = '''\
ID In_Out Amount
1 In 5
1 Out 8
2 In 4
2 Out 2
3 In 3
3 Out 9
4 Out 8'''
# Recreate dataframe
df = pd.read_csv(pd.compat.StringIO(s), sep='\s+')
# Select rows where In_Out == 'Out' and multiple by -1
df.loc[df['In_Out'] == 'Out', 'Amount'] *= -1
# Convert to dict
d = df.groupby('ID')['Amount'].sum().to_dict()
print(d)
Returns:
{1: -3, 2: 2, 3: -6, 4: -8}

Python Pandas - filtering df by the number of unique values within a group

Here is an example of data I'm working on. (as a pandas df)
index inv Rev_stream Bill_type Net_rev
1 1 A Original -24.77
2 1 B Original -24.77
3 2 A Original -409.33
4 2 B Original -409.33
5 2 C Original -409.33
6 2 D Original -409.33
7 3 A Original -843.11
8 3 A Rebill 279.5
9 3 B Original -843.11
10 4 A Rebill 279.5
11 4 B Original -843.11
12 5 B Rebill 279.5
How could I filter this df, in a way to only get the lines where invoice/Rev_stream combo has both original and rebill kind of Net_rev. In the example above it would be only lines with index 7 and 8.
Is there an easy way to do it, without iterating over the whole dataframe and building dictionaries of invoice+RevStream : Bill_type?
What I'm looking for is some kind of
df = df[df[['inv','Rev_stream']]['Bill_type'].unique().len() == 2]
Unfortunately the code above doesn't work.
Thanks in advance.
You can group your data by inv and Rev_stream columns and then check for each group if both Original and Rebill are in the Bill_type values and filter based on the condition:
(df.groupby(['inv', 'Rev_stream'])
.filter(lambda g: 'Original' in g.Bill_type.values and 'Rebill' in g.Bill_type.values))

Categories

Resources