How to compress rows after groupby in pandas

How to compress rows after groupby in pandas - python

I have performed a groupby on my dataframe.
grouped = data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
I am getting the below output :
data_df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()
Out[81]:
Cluster Visit Number Final
0 1 21846
2 1485
3 299
4 95
5 24
6 8
7 3
1 1 33600
2 2283
3 404
4 117
5 34
6 7
2 1 5858
2 311
3 55
4 14
5 6
6 3
7 1
3 1 19699
2 1101
3 214
4 78
5 14
6 8
7 3
4 1 10086
2 344
3 59
4 14
5 3
6 1
Name: Visitor_ID, dtype: int64
Now i want to compress the rows whose Visit Number Final >3(Add a new row which has the summation for visit number final 4,5,6). I am trying groupby.filter but not getting the expected output.
My final output should look like
Cluster Visit Number Final
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18

The easiest way is to replace the 'Visit Number Final' values bigger than 3, before you group the dataframe:
df.loc[df['Visit Number Final'] > 3, 'Visit Number Final'] = '>=4'
df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()

Try this:
visit_val = df.index.get_level_values(1)
grp = np.where((visit_val <= 3) == 0, '>=4', visit_val)
(df.groupby(['Cluster',grp])['Number Final'].sum()
.reset_index().rename(columns={'level_1':'Visit'}))
Output:
Cluster Visit Number Final
0 0 1 21846
1 0 2 1485
2 0 3 299
3 0 >=4 130
4 1 1 33600
5 1 2 2283
6 1 3 404
7 1 >=4 158
8 2 1 5858
9 2 2 311
10 2 3 55
11 2 >=4 24
12 3 1 19699
13 3 2 1101
14 3 3 214
15 3 >=4 103
16 4 1 10086
17 4 2 344
18 4 3 59
19 4 >=4 18
Or to get dataframe with indexes:
(df.groupby(['Cluster',grp])['Number Final'].sum()
.rename_axis(['Cluster','Visit']).to_frame())
Output:
Number Final
Cluster Visit
0 1 21846
2 1485
3 299
>=4 130
1 1 33600
2 2283
3 404
>=4 158
2 1 5858
2 311
3 55
>=4 24
3 1 19699
2 1101
3 214
>=4 103
4 1 10086
2 344
3 59
>=4 18

Related

pandas - Select Last Row of Column Based on Different Column Value

I've got a dataframe like
Season Game Event_Num Home Away Margin
0 2016-17 1 1 0 0 0
1 2016-17 1 2 0 0 0
2 2016-17 1 3 0 2 2
3 2016-17 1 4 0 2 2
4 2016-17 1 5 0 2 2
.. ... ... ... ... ... ...
95 2017-18 5 53 17 10 7
96 2017-18 5 54 17 10 7
97 2017-18 5 55 17 10 7
98 2017-18 5 56 17 10 7
99 2017-18 5 57 17 10 7
And ultimately, I'd like to take the last row of each Game played, so for instance, the last row for Game 1, Game 2, etc. so I can see what the final margin was, but I'd like to do this for every unique season.
For example, if there were 3 games played for 2 unique seasons then the df would look something like:
Season Game Event_Num Home Away Final Margin
0 2016-17 1 1 90 80 10
1 2016-17 2 2 83 88 5
2 2016-17 3 3 67 78 11
3 2017-18 1 4 101 102 1
4 2017-18 2 5 112 132 20
5 2017-18 3 6
Is there a good way to do something like this? TIA.

Try:
df.groupby(['Season','Game']).tail(1)
output
Season Game Event_Num Home Away Margin
4 2016-17 1 5 0 2 2
9 2017-18 5 57 17 10 7

How to get the number of events in a regular interval of time in a dataframe

Assume I'm having dataframe as shown below.
In the data frame we are representing the events occurred on every sec.
Time events_occured
1 2
2 3
3 7
4 4
5 6
6 3
7 86
8 26
9 7
10 26
. .
. .
. .
996 56
997 26
998 97
999 58
1000 34
Now I need to get the cumulative occurrences of events in every 5 secs.
As in first 5 seconds 22 events occurred, from 6 to 10 secs 148 events occurred and so on.

Like this:
In [647]: df['cumulative'] = df.events_occured.groupby(df.index // 5).cumsum()
In [648]: df
Out[648]:
Time events_occured cumulative
0 1 2 2
1 2 3 5
2 3 7 12
3 4 4 16
4 5 6 22
5 6 3 3
6 7 86 89
7 8 26 115
8 9 7 122
9 10 26 148

if there are missing values of Time using df.index could produce errors in the logic so use df['Time'].
It also works if time starts at any value N and if there are missing values greater than N
GROUP_SIZE = 5
df['cumulative'] = df.events_occured\
.groupby(df['Time'].sub(df['Time'].min()) // GROUP_SIZE).cumsum()
print(df)
Time events_occured cumulative
0 1 2 2
1 2 3 5
2 3 7 12
3 4 4 16
4 5 6 22
5 6 3 3
6 7 86 89
7 8 26 115
8 9 7 122
9 10 26 148

assign a number id for every 4 rows in pandas dataframe

I have a pandas dataframe like this:
pd.DataFrame({'week': ['2019-w01', '2019-w02','2019-w03','2019-w04',
'2019-w05','2019-w06','2019-w07','2019-w08',
'2019-w9','2019-w10','2019-w11','2019-w12'],
'value': [11,22,33,34,57,88,2,9,10,1,76,14],
'period': [1,1,1,1,2,2,2,2,3,3,3,3]})
week value
0 2019-w1 11
1 2019-w2 22
2 2019-w3 33
3 2019-w4 34
4 2019-w5 57
5 2019-w6 88
6 2019-w7 2
7 2019-w8 9
8 2019-w9 10
9 2019-w10 1
10 2019-w11 76
11 2019-w12 14
what I need is like below. I would like to assign a period ID every 4-week interval.
week value period
0 2019-w01 11 1
1 2019-w02 22 1
2 2019-w03 33 1
3 2019-w04 34 1
4 2019-w05 57 2
5 2019-w06 88 2
6 2019-w07 2 2
7 2019-w08 9 2
8 2019-w9 10 3
9 2019-w10 1 3
10 2019-w11 76 3
11 2019-w12 14 3
what is the best way to achieve that? Thanks.

try with:
df['period']=(pd.to_numeric(df['week'].str.split('-').str[-1]
.str.replace('w',''))//4).shift(fill_value=0).add(1)
print(df)
week value period
0 2019-w01 11 1
1 2019-w02 22 1
2 2019-w03 33 1
3 2019-w04 34 1
4 2019-w05 57 2
5 2019-w06 88 2
6 2019-w07 2 2
7 2019-w08 9 2
8 2019-w9 10 3
9 2019-w10 1 3
10 2019-w11 76 3
11 2019-w12 14 3

Dropping certain values in columns depending on contents of other column

I have a dataframe that looks like this:
Deal Year Financial Data1 Financial Data2 Financial Data3 Quarter
0 1 1991/1/1 122 123 120 1
3 1 1991/1/1 122 123 120 2
6 1 1991/1/1 122 123 120 3
1 2 1992/1/1 85 90 80 4
4 2 1992/1/1 85 90 80 5
7 2 1992/1/1 85 90 80 6
2 3 1993/1/1 85 90 100 1
5 3 1993/1/1 85 90 100 2
8 3 1993/1/1 85 90 100 3
However I only want the Financial Data1 displayed for the first quarter in each deal and The whole thing combined into one column again.
The end result should look something like this:
Deal Year Financial Data Quarter
0 1 1991/1/1 122 1
3 1 1991/1/1 123 2
6 1 1991/1/1 120 3
1 2 1992/1/1 85 4
4 2 1992/1/1 90 5
7 2 1992/1/1 80 6
2 3 1993/1/1 85 1
5 3 1993/1/1 90 2
8 3 1993/1/1 100 3

Okie dokie, using np.where() I think this does what you're trying to do:
import pandas as pd
import numpy as np
df = pd.read_fwf(StringIO(
"""Deal Year Financial_Data1 Financial_Data2 Financial_Data3 Quarter
1 1991/1/1 122 123 120 1
1 1991/1/1 122 123 120 2
1 1991/1/1 122 123 120 3
2 1992/1/1 85 90 80 4
2 1992/1/1 85 90 80 5
2 1992/1/1 85 90 80 6
3 1993/1/1 85 90 100 1
3 1993/1/1 85 90 100 2
3 1993/1/1 85 90 100 3"""))
df['Financial_Data'] = np.where(
# if 'Quarter'%3==1
df['Quarter']%3==1,
# Then return Financial_Data1
df['Financial_Data1'],
# Else
np.where(
# If 'Quarter'%3==2
df['Quarter']%3==2,
# Then return Financial_Data2
df['Financial_Data2'],
# Else return Financial_Data3
df['Financial_Data3']
)
)
# Drop Old Columns
df = df.drop(['Financial_Data1', 'Financial_Data2', 'Financial_Data3'], axis=1)
print(df)
Output:
Deal Year Quarter Financial_Data
0 1 1991/1/1 1 122
1 1 1991/1/1 2 123
2 1 1991/1/1 3 120
3 2 1992/1/1 4 85
4 2 1992/1/1 5 90
5 2 1992/1/1 6 80
6 3 1993/1/1 1 85
7 3 1993/1/1 2 90
8 3 1993/1/1 3 100
(PS: I wasn't 100% sure how you intended on dealing with Quarter 4-6, in this example I just treat them as 1-3)

Subtraction using different columns in multiple dictionaries

I have two dicts, one with three columns (A) and another with six columns (B), I would like to be able to use the value in the first column (index which is constant for both 1-4) and also the value in the second column (1-2000) to specify the correct element in the third column for subtraction. The second dict is similar in that the first and second columns are used to find the correct row however it is the value in the sixth column of that row that is needed for the subtraction.
A B
1 1 260 541 1 1 260 280 0.001 521.4
1 1 390 1195 1 1 390 900 0.02 963.3
1 1 102 6 1 1 102 2 0.01 4.8
2 1 65 12 2 1 65 9 0.13 13.1
2 1 515 659 2 1 515 356 0.002 532.2
2 1 354 1200 2 1 354 1087 0.119 1502.3
3 1 1190 53 3 1 1190 46 0.058 12.0
3 1 1985 3 3 1 1985 1 0.006 1.02
3 1 457 192 3 1 25 3 0.001 178.2
4 1 261 2084 4 1 261 1792 0.196 100.7
4 1 12 0 4 1 12 0 0.000 12.6
4 1 1756 30 4 1 1756 28 0.006 23.7
4 1 592 354 4 1 592 291 0.357 251.9
So basically I would like to subtract the last column of B from the last column of A whilst retaining the information held in the first and second columns.
C (desired output)
1 1 260 19.6
1 1 390 231.7
1 1 102 1.2
2 1 65 -1.1
2 1 515 126.8
2 1 354 -302.3
3 1 1190 41.0
3 1 1985 1.98
3 1 457 13.8
4 1 261 1983.3
4 1 12 -12.6
4 1 1756 6.3
4 1 592 102.1
I have been through SO for hours looking for a solution but havent found a solution as of yet but I'm sure it must be possible.
I need to be able to create a scatter graph afterwards as well in case anyone has any suggestions as to how to plot positive values and ignore the negatives.
EDIT:
I have added my code below to make it clearer, I take in a three column csv file and then need to get a count of the frequency of each value of the third column when they have the same value in the first column. B then has further alterations to get out the desired data streams and then the subtraction needs to be made. In a few of the comments it mentioned that column one and two are unnecessary but the value in column three is linked to the value in column one and thus must always remain in the same row together.
import pandas as pd
import numpy as np
def ba(fn, float1, float2):
ba=pd.read_csv(fn,header=None, skipfooter=6, engine='python')
ba['col4']=ba.groupby(['col1','col3']).transform(np.size)
ba['col5']=ba['col4'].apply(lambda x: x/float(float2))
ba['col6']=ba['col5'].apply(lambda x: x*float1)
ba=ba.set_index('col1')
ba = dict(tuple(ba.groupby('col1')))
return ba

IIUIC, A and B are dataframes then
In [1062]: A.iloc[:, :3].assign(output=A.iloc[:, -1] - B.iloc[:, -1])
Out[1062]:
0 1 2 output
0 1 1 260 19.60
1 1 1 390 231.70
2 1 1 102 1.20
3 2 1 65 -1.10
4 2 1 515 126.80
5 2 1 354 -302.30
6 3 1 1190 41.00
7 3 1 1985 1.98
8 3 1 457 13.80
9 4 1 261 1983.30
10 4 1 12 -12.60
11 4 1 1756 6.30
12 4 1 592 102.10
Details
In [1063]: A
Out[1063]:
0 1 2 3
0 1 1 260 541
1 1 1 390 1195
2 1 1 102 6
3 2 1 65 12
4 2 1 515 659
5 2 1 354 1200
6 3 1 1190 53
7 3 1 1985 3
8 3 1 457 192
9 4 1 261 2084
10 4 1 12 0
11 4 1 1756 30
12 4 1 592 354
In [1064]: B
Out[1064]:
0 1 2 3 4 5
0 1 1 260 280 0.001 521.40
1 1 1 390 900 0.020 963.30
2 1 1 102 2 0.010 4.80
3 2 1 65 9 0.130 13.10
4 2 1 515 356 0.002 532.20
5 2 1 354 1087 0.119 1502.30
6 3 1 1190 46 0.058 12.00
7 3 1 1985 1 0.006 1.02
8 3 1 25 3 0.001 178.20
9 4 1 261 1792 0.196 100.70
10 4 1 12 0 0.000 12.60
11 4 1 1756 28 0.006 23.70
12 4 1 592 291 0.357 251.90

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to compress rows after groupby in pandas - python

The easiest way is to replace the 'Visit Number Final' values bigger than 3, before you group the dataframe: df.loc[df['Visit Number Final'] > 3, 'Visit Number Final'] = '>=4' df.groupby(['Cluster','Visit Number Final'])['Visitor_ID'].count()

Related

pandas - Select Last Row of Column Based on Different Column Value

How to get the number of events in a regular interval of time in a dataframe

assign a number id for every 4 rows in pandas dataframe

Dropping certain values in columns depending on contents of other column

Subtraction using different columns in multiple dictionaries

Categories

Resources