Pandas sum values in a group within a group - python

First of all, sorry for the bad title. I will illustrate better here. I have a dataframe such as this:
level 1
level 2
qty 2
level 3
qty 3
level 4
qty 4
1980
2302
1.2
nan
nan
nan
-----
1980
7117
2.4
10025
15
2343
11
1980
7117
2.4
1221
1.3
nan
nan
1870
2333
22
nan
nan
nan
nan
1870
7117
2.1
10025
12
nan
nan
1870
7117
2.1
5445
11
nan
nan
It is a flatten hierarchy that describe which components goes into a product. Level 1 being the finished goods (e.g. pizza) and level 2,3 and so on are ingredients used to make said product. I need to the following logic.
df_grouped = df.groupby (by = ['level 1'])
range = [4,3,2]
for group in df_grouped:
for i in range:
df[f'qty {i}] = df[f'qty {i-1}'] * df[f'qty {i}']/df[f'qty {i}'].groupby(f'level {i-1}')[f'qty {i}'].transform ('sum'))
Okey, so what I need to do if we, for instance, look at level 1 = 1980 and level 2 = 7117. I need to take 2.4 * 15/(15+1.3). The same goes for the row below: 2.4 * 1.3/(15 + 1.3)
This needs to be done for each level of each level 1 (product)
expected output:
level 1
level 2
qty 2
level 3
qty 3
level 4
qty 4
1980
2302
1.2
nan
nan
nan
-----
1980
7117
2.4
10025
2.20858895706
2343
15
1980
7117
2.4
1221
0.19141104294
nan
nan
1870
2333
22
nan
nan
nan
nan
1870
7117
2.1
10025
1.09565217391
nan
nan
1870
7117
2.1
5445
1.00434782609
nan
nan

Related

How to get max of a slice of a dataframe based on column values?

I'm looking to make a new column, MaxPriceBetweenEntries based on the max() of a slice of the dataframe
idx Price EntryBar ExitBar
0 10.00 0 1
1 11.00 NaN NaN
2 10.15 2 4
3 12.14 NaN NaN
4 10.30 NaN NaN
turned into
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 10.00 0 1 11.00
1 11.00 NaN NaN NaN
2 10.15 2 4 12.14
3 12.14 NaN NaN NaN
4 10.30 NaN NaN NaN
I can get all the rows with an EntryBar or ExitBar value with df.loc[df["EntryBar"].notnull()] and df.loc[df["ExitBar"].notnull()], but I can't use that to set a new column:
df.loc[df["EntryBar"].notnull(),"MaxPriceBetweenEntries"] = df.loc[df["EntryBar"]:df["ExitBar"]]["Price"].max()
but that's effectively a guess at this point, because nothing I'm trying works. Ideally the solution wouldn't involve a loop directly because there may be millions of rows.
You can groupby the cumulative sum of non-null entries and take the max, unsing np.where() to only apply to non-null rows::
df['MaxPriceBetweenEntries'] = np.where(df['EntryBar'].notnull(),
df.groupby(df['EntryBar'].notnull().cumsum())['Price'].transform('max'),
np.nan)
df
Out[1]:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
Let's try groupby() and where:
s = df['EntryBar'].notna()
df['MaxPriceBetweenEntries'] = df.groupby(s.cumsum())['Price'].transform('max').where(s)
Output:
idx Price EntryBar ExitBar MaxPriceBetweenEntries
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN
You can forward fill the null values, group by entry and get the max of that groups Price. Use that as the right side of a left join and you should be in business.
df.merge(df.ffill().groupby('EntryBar')['Price'].max().reset_index(name='MaxPriceBetweenEntries'),
on='EntryBar',
how='left')
Try
df.loc[df['ExitBar'].notna(),'Max']=df.groupby(df['ExitBar'].ffill()).Price.max().values
df
Out[74]:
idx Price EntryBar ExitBar Max
0 0 10.00 0.0 1.0 11.00
1 1 11.00 NaN NaN NaN
2 2 10.15 2.0 4.0 12.14
3 3 12.14 NaN NaN NaN
4 4 10.30 NaN NaN NaN

when i add grouping function for creating a new column in DF, it's not working as expected

Group by result
empdf.groupby('deptno')['sal'].max()
deptno
10 5000.0
20 3000.0
30 2850.0
I joined this result to my DF empdf, but the result is not coming. Below is query and result.
empdf.assign(maxsal_dept = empdf.groupby('deptno')['sal'].max())
empno
ename
job
mgr
hiredate
sal
comm
deptno
totalsal
rnk
dnsrnk
maxsal_dept
0
7839 KING PRESIDENT NaN 1981-11-17 00:00:00 5000.0 50.0 10 5050.0 1 1 NaN
1
7698 BLAKE MANAGER 7839.0 1981-05-01 00:00:00 2850.0 285.0 30 3135.0 5 4 NaN
2
7782 CLARK MANAGER 7839.0 1981-06-09 00:00:00 2450.0 24.5 10 2474.5 6 5 NaN
3
7566 JONES MANAGER 7839.0 1981-04-02 00:00:00 2975.0 NaN 20 2975.0 4 3 NaN
4
7788 SCOTT ANALYST 7566.0 1987-04-19 00:00:00 3000.0 NaN 20 3000.0 2 2 NaN
5
7902 FORD ANALYST 7566.0 1981-12-03 00:00:00 3000.0 NaN 20 3000.0 3 2 NaN
6
7369 SMITH CLERK 7902.0 1980-12-17 00:00:00 800.0 NaN 20 800.0 14 12 NaN
By grouping result is
i want to add this group to DF for creating new column, but it's not giving right result. Highlited column in yellow color.
You have to use transform
Try the below snippet
empdf['maxsal_deptl']= empdf.groupby('deptno')['sal'].transform('max')

dropping a range of rows in a pandas data frame creates a key error

I have several different data frames, that I need to drop certain rows from. Each data frame has the same sequence of rows but located in different areas
Summary Results Report Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3
0 DEM President NaN NaN NaN NaN
1 Vote For 1 NaN NaN NaN NaN
2 NaN NaN Ballots By NaN Election
3 TOTAL NaN NaN Early Voting NaN
4 NaN NaN Mail NaN Day
5 Tom Steyer NaN 0 0 0 0
6 Andrew Yang NaN 0 0 0 0
7 John K. Delaney NaN 0 0 0 0
8 Cory Booker NaN 0 0 0 0
9 Michael R. Bloomberg NaN 4 1 1 2
10 Julian Castro NaN 0 0 0 0
11 Elizabeth Warren NaN 1 0 1 0
12 Marianne Williamson NaN 0 0 0 0
13 Deval Patrick NaN 0 0 0 0
14 Robby Wells NaN 0 0 0 0
15 Amy Klobuchar NaN 3 1 2 0
16 Tulsi Gabbard NaN 0 0 0 0
17 Michael Bennet NaN 0 0 0 0
18 Bernie Sanders NaN 4 0 1 3
19 Pete Buttigieg NaN 0 0 0 0
20 Joseph R. Biden 21.0 0 3 18
21 Roque "Rocky" De La Fuente NaN 0 0 0 0
22 Total Votes Cast 33.0 2 8 23
Summary Results Report Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 DEM US Senator NaN NaN NaN NaN NaN NaN
1 Vote For 1 NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN Ballots By NaN Election NaN
3 TOTAL NaN NaN NaN Early Voting NaN NaN
4 NaN NaN NaN Mail NaN Day NaN
5 Jack Daniel Foster, Jr. 4.0 NaN 0 0 4 NaN
6 Mary (MJ) Hegar 4.0 NaN 1 3 0 NaN
7 Amanda K. Edwards 4.0 NaN 1 1 2 NaN
8 D. R. Hunter 1.0 NaN 0 0 1 NaN
9 Michael Cooper 3.0 NaN 0 0 3 NaN
10 Chris Bell 1.0 NaN 0 0 1 NaN
11 Royce West 3.0 NaN 0 0 3 NaN
12 Cristina Tzintzun Ramirez 5.0 NaN 0 3 2 NaN
13 Victor Hugo Harris 1.0 NaN 0 0 1 NaN
14 Sema Hernandez 1.0 NaN 0 0 1 NaN
15 Adrian Ocegueda 0.0 NaN 0 0 0 NaN
16 Annie "Mama" Garcia 3.0 NaN 0 1 2 NaN
17 Total Votes Cast 30 NaN NaN 2 8 20 NaN
18 DEM US Representative, Dist 1 NaN NaN NaN NaN NaN NaN
19 Vote For 1 NaN NaN NaN NaN NaN NaN
20 NaN NaN NaN Ballots By NaN Election NaN
21 TOTAL NaN NaN NaN Early Voting NaN NaN
22 NaN NaN NaN Mail NaN Day NaN
23 Hank Gilbert 26 NaN NaN 1 6 19 NaN
24 Total Votes Cast 26 NaN NaN 1 6 19 NaN
What I want to remove is the row that contains Vote for 1 in the first column, as well as the following 3 rows. The problem is that they can show in multiple areas, and in occasion multiple times (such as the second data frame). What I have seems to be working, in the aspect that it removes the required rows, however, at the end, it will give me a key error which tells me that it is re-looping through without any data.
for x in range(len(df)):
if 'Vote For 1' in str(df.iloc[:,0][x]):
y = x+3
df = df.drop(df.loc[x:y].index)
df.reset_index(drop=True,inplace=True)
df.index.name=None
print(df)
the code produces the following output:
Summary Results Report Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5
0 DEM US Senator NaN NaN NaN NaN NaN NaN
1 Jack Daniel Foster, Jr. 4.0 NaN 0 0 4 NaN
2 Mary (MJ) Hegar 4.0 NaN 1 3 0 NaN
3 Amanda K. Edwards 4.0 NaN 1 1 2 NaN
4 D. R. Hunter 1.0 NaN 0 0 1 NaN
5 Michael Cooper 3.0 NaN 0 0 3 NaN
6 Chris Bell 1.0 NaN 0 0 1 NaN
7 Royce West 3.0 NaN 0 0 3 NaN
8 Cristina Tzintzun Ramirez 5.0 NaN 0 3 2 NaN
9 Victor Hugo Harris 1.0 NaN 0 0 1 NaN
10 Sema Hernandez 1.0 NaN 0 0 1 NaN
11 Adrian Ocegueda 0.0 NaN 0 0 0 NaN
12 Annie "Mama" Garcia 3.0 NaN 0 1 2 NaN
13 Total Votes Cast 30 NaN NaN 2 8 20 NaN
14 DEM US Representative, Dist 1 NaN NaN NaN NaN NaN NaN
15 Hank Gilbert 26 NaN NaN 1 6 19 NaN
16 Total Votes Cast 26 NaN NaN 1 6 19 NaN
It errors out at the end with KeyError: 17. Any advice is greatly appreciated.
####EDIT####
I just wanted to give an update on the code that finally solved my problem. I know that it is probably a little robust, but it does work.
remove_strings=['Vote For 1','TOTAL']
remove_strings_list = df.index[df['Summary Results Report'].isin(remove_strings)].tolist()
df = df.drop(df.index[remove_strings_list])
Not exactly sure what your column names are, but is the summary column contains the names and the few names you want to remove, this should work. Else you may have to change the column name accordingly.
strings_to_remove = ['Vote for 1', 'Total', 'NaN']
df[~df.summary.isin(strings_to_remove)]

Pandas pd.merge gives nan

I have two dataframes, which I need to merge/join based on a column. When I try to join/merge them, the new columns gives NaN.
Basically, I need to perform Left Join on the dataframes, considering df_user as the dataframe on the Left.
PS: The column on both the dataframes have same datatype.
Please find the dataframes below -
df_user.dtypes
App category
Sentiment int8
Sentiment_Polarity float64
Sentiment_Subjectivity float64
df_play.dtypes
App category
Category category
Rating float64
Reviews float64
Size float64
Installs int64
Type int8
Price float64
Content Rating int8
Installs_Cat int8
df_play.head()
App Category Rating Reviews Size Installs Type Price Content Installs_Cat
0 SPrapBook ART_AND_DESIGN 4.1 159 19 10000 0 0 0 9
1 U Launcher ART_AND_DESIGN 4.5 87510 25 5000000 0 0 0 14
2 Sketch - ART_AND_DESIGN 4.3 215644 2.8 50000000 0 0 1 16
3 Pixel Dra ART_AND_DESIGN 4.4 967 5.6 100000 0 0 0 11
4 Paper flo ART_AND_DESIGN 3.8 167 19 50000 0 0 0 10
df_user.head()
App Sentiment Sentiment_Polarity Sentiment_Subjectivity
0 10 Best Foods for You 2 1.00 0.533333
1 10 Best Foods for You 2 0.25 0.288462
3 10 Best Foods for You 2 0.40 0.875000
4 10 Best Foods for You 2 1.00 0.300000
5 10 Best Foods for You 2 1.00 0.300000
I tried both the codes below -
result = pd.merge(df_user, df_play, how='left', on='App')
result = df_user.join(df_play.set_index('App'),on='App',how='left',rsuffix='_y')
But all i got was -
App Sentiment Sentiment_Polarity Sentiment_Subjectivity Category Rating Reviews Size Installs Type Price Content Rating Installs_Cat
0 10 Best Foods for You 2 1.00 0.533333 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 10 Best Foods for You 2 0.25 0.288462 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 10 Best Foods for You 2 0.40 0.875000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 10 Best Foods for You 2 1.00 0.300000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 10 Best Foods for You 2 1.00 0.300000 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Please excuse me for the formatting.

computing correlation between values of one column

I have a huge dataframe that looks like this:
gemeente Partij Perioden Bevolking/Bevolkingssamenstelling op 1 januari/Totale bevolking (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Geslacht/Mannen (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Geslacht/Vrouwen (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/Jonger dan 5 jaar (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/5 tot 10 jaar (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/10 tot 15 jaar (aantal) Bevolking/Bevolkingssamenstelling op 1 januari/Leeftijd/Leeftijdsgroepen/15 tot 20 jaar (aantal)
0 's-Hertogenbosch VVD 2007 135648.0 66669.0 68979.0 7986.0 7809.0 7514.0 7612.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 's-Hertogenbosch VVD 2008 136481.0 67047.0 69434.0 7885.0 7853.0 7517.0 7680.0 ... 5.8 8.6 41.3 5.2 4.0 20.0 4.0 5.0 25.0 3.0
2 's-Hertogenbosch VVD 2009 137775.0 67715.0 70060.0 7915.0 7890.0 7497.0 7628.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 's-Hertogenbosch VVD 2010 139607.0 68628.0 70979.0 8127.0 7852.0 7527.0 7752.0 ... 5.6 8.4 40.7 5.4 4.0 20.0 3.0 5.0 24.0 3.0
4 Aa en Hunze PVDA 2007 25563.0 12653.0 12910.0
Partij consists of 6 possible labels and I have 270 columns.
I want to compute the correlation and/or similarity between those 6 labels in Partij with the data from those 270 columns.
I tried pd.groupby but that only give me correlations between columns and not parties.
I tried to make a pd.pivot_table but and make the Partij as column names but then I still had all those normal column names and couldn't access the Partij names and compute correlation.
You can make Partij values appear as columns by using the transpose method of pandas' DataFrame:
df = df.transpose()

Categories

Resources