How to sum up a column based on another columns value Python

How to sum up a column based on another columns value Python - python

I have this example df
col1 = [1,1,1,2,2,1,1,1,2,2,2]
col2 = [20, 23, 12, 44, 14, 42, 44, 1, 42, 62, 11]
data = {"col1": col1, "col2": col2}
df = pd.DataFrame(data)
I need to add a column that adds up the col2 every time the col1 is 1 and then the same for when it is 2. I have tried grouping by col1 but this skips every time there is a 2 in between
The expected output would be this.
col1 col2 col3
1 20 55
1 23 55
1 12 55
2 44 58
2 14 58
1 42 87
1 44 87
1 1 87
2 42 115
2 62 115
2 11 115
Please let me know how to approach this

Use GroupBy.transform with helper Series for consecutive values generated by comapre shifted values for not equal and cumulative sum:
df['col3'] = df.groupby(df['col1'].ne(df['col1'].shift()).cumsum())['col2'].transform('sum')
print (df)
col1 col2 col3
0 1 20 55
1 1 23 55
2 1 12 55
3 2 44 58
4 2 14 58
5 1 42 87
6 1 44 87
7 1 1 87
8 2 42 115
9 2 62 115
10 2 11 115

You can do this by creating a column that will mark every time there is a change in col1 and then sum by groupby:
i = df.col1
df['Var3'] = i.ne(i.shift()).cumsum()
df['sums'] = df.groupby(['Var3'])['col2'].transform('sum')
which gives
col1 col2 Var3 sums
0 1 20 1 55
1 1 23 1 55
2 1 12 1 55
3 2 44 2 58
4 2 14 2 58
5 1 42 3 87
6 1 44 3 87
7 1 1 3 87
8 2 42 4 115
9 2 62 4 115
10 2 11 4 115

Related

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])

Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64

One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483

You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Multiply dataframe rows depends on value in this row

I have a dataframe like this:
df = pd.DataFrame({'col1': [69, 77, 88],
'col2': ['bar34', 'barf30', 'barfoo29'],
'col3': [4, 2, 5]})
print(df, '\n')
col1 col2 col3
0 69 bar34 4
1 77 barf30 2
2 88 barfoo29 5
I need multiply rows depends on value in 'col3'. Desired output:
col1 col2 col3
0 69 bar34 4
1 69 bar34 4
2 69 bar34 4
3 69 bar34 4
4 77 barf30 2
5 77 barf30 2
6 88 barfoo29 5
7 88 barfoo29 5
8 88 barfoo29 5
9 88 barfoo29 5
10 88 barfoo29 5

Here is a solution, Index.repeat then DataFrame.reindex
df = df.set_index(df.col3)
print(
df.reindex(df.index.repeat(df.col3))
.reset_index(drop=True)
)
# suggested by #anky,
df.loc[df.index.repeat(df.col3)]
col1 col2 col3
0 69 bar34 4
1 69 bar34 4
2 69 bar34 4
3 69 bar34 4
4 77 barf30 2
5 77 barf30 2
6 88 barfoo29 5
7 88 barfoo29 5
8 88 barfoo29 5
9 88 barfoo29 5
10 88 barfoo29 5

I have only one solution, but i'm pretty sure it's not efficient at all.
import numpy as np
# Get columns list
cols = df.columns.to_list()
# Loop for each row to multiply
for index, row in df.iterrows():
# Loop for each new row we get
full_array = []
for new_row in range(row['col3']):
row_lst = [row[col_name] for col_name in cols]
full_array.append(row_lst)
numpy_data = np.array(full_array)
# Drop used row
df = df.drop([index])
# Creating mini_df
mini_df = pd.DataFrame(numpy_data, columns=columns)
# Concat with main dataframe
df = pd.concat([df, mini_df], ignore_index=True)
df = df.reset_index(drop=True)
print(df)
col1 col2 col3
0 69 bar34 4
1 69 bar34 4
2 69 bar34 4
3 69 bar34 4
4 77 barf30 2
5 77 barf30 2
6 88 barfoo29 5
7 88 barfoo29 5
8 88 barfoo29 5
9 88 barfoo29 5
10 88 barfoo29 5

how to generate pandas dataframe basis list with condition

I have following list in python
movie_list = [11, 21, 31, 41, 51, 62, 55]
and following movie dataframe
userId movieId
1 11
1 21
1 31
2 62
2 55
Now what I want to do is generate similar dataframe, where movieId is not in dataframe, but there in movie_list
My desired dataframe would be
userId movieId
1 41
1 51
1 62
1 55
2 11
2 21
2 31
2 41
2 51
How can I do it in pandas?

IIUC, we can do the agg with list , then find the different between the original value in df with the movie_list
s=df.groupby('userId').movieId.agg(list).\
map(lambda x : list(set(movie_list)-set(x))).explode().reset_index()
userId movieId
0 1 41
1 1 51
2 1 62
3 1 55
4 2 41
5 2 11
6 2 51
7 2 21
8 2 31

One approach would be to use itertools.product to create all combinations of userId & movieId, then concat and drop_duplicates:
from itertools import product
movie_list = [11, 21, 31, 41, 51, 62, 55]
df_all = pd.DataFrame(product(df['userId'].unique(), movie_list), columns=df.columns)
df2 = pd.concat([df, df_all]).drop_duplicates(keep=False)
print(df2)
[out]
userId movieId
3 1 41
4 1 51
5 1 62
6 1 55
7 2 11
8 2 21
9 2 31
10 2 41
11 2 51

prod = pd.MultiIndex.from_product([df.userId.unique().tolist(), movie_list]).tolist()
(
pd.DataFrame(set(prod).difference([tuple(e) for e in df.values]),
columns=['userId', 'movieId'])
.sort_values(by=['userId', 'movieId'])
)
userId movieId
7 1 41
6 1 51
2 1 55
8 1 62
5 2 11
4 2 21
3 2 31
1 2 41
0 2 51

I think you need:
df = df.groupby("userId")["movieId"].apply(list).reset_index()
df["movieId"] = df["movieId"].apply(lambda x: list(set(movie_list)-set(x)))
df = df.explode("movieId")
print(df)
Output:
userId movieId
0 1 41
0 1 51
0 1 62
0 1 55
1 2 41
1 2 11
1 2 51
1 2 21
1 2 31

insert dataframe into a dataframe - Python/Pandas

Question is pretty self explanatory, how would you insert a dataframe with a couple of values in to a bigger dataframe at a given point (between index's 10 and 11). Meaning that .append cant be used

You can use concat with sliced df by loc:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(100, size=(5,6)), columns=list('ABCDEF'))
print (df1)
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
df2 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#inserted between 4 and 5 index values
print (pd.concat([df1.loc[:4], df2, df1.loc[4:]], ignore_index=True))
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
5 1 4 7 1 5 7
6 2 5 8 3 3 4
7 3 6 9 5 6 3
8 27 4 31 1 13 83

Pandas: compute numerous columns of percentage values

I'm failing to loop through the values of select dataframe columns in order to create new columns representing percentage values. Reproducible example:
data = {'Respondents': [90, 43, 89, '89', '67', '88', '73', '78', '62', '101'],
'answer_1': [51, 15, 15, 61, 16, 14, 15, 1, 0, 16],
'answer_2': [11, 12, 14, 40, 36, 78, 12, 0, 26, 78],
'answer_3': [3, 8, 4, 0, 2, 7, 10, 11, 6, 7]}
df = pd.DataFrame(data)
df
Respondents answer_1 answer_2 answer_3
0 90 51 11 3
1 43 15 12 8
2 89 15 14 4
3 89 61 35 0
4 67 16 36 2
5 88 14 78 7
6 73 15 12 10
7 78 1 0 11
8 62 0 26 6
9 101 16 78 7
The aim is to compute the percentage for each of the answers columns against the total respondents. For example, for the new answer_1 column – let's name it answer_1_perc – the first value would be 46 (because 51 is 46% of 90), the next value would be 35 (15 is 35% of 43). Then there would be answer_2_perc and answer_3_perc columns.
I've written so many iterations of the following code my head's spinning.
for columns in df.iloc[:, 1:4]:
for i in columns:
i_name = 'percentage_' + str(columns)
i_group = ([i] / df['Respondents'] * 100)
df[i_name] = i_group
What is the best way to do this? I need to use an iterative method as my actual data has 25 answer columns rather than the 3 shown in this example.

You almost had it, note that you have string values in respondents col which I've corrected prior to calling the following:
In [172]:
for col in df.columns[1:4]:
i_name = 'percentage_' + col
i_group = (df[col] / df['Respondents']) * 100
df[i_name] = i_group
df
Out[172]:
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693

I recommend using div and concat:
df['Respondents'] = df['Respondents'].astype(float)
df_pct = (df.drop('Respondents', axis=1)
.div(df['Respondents'], axis=0)
.mul(100)
.rename(columns=lambda col: 'percentage_' + col)
)
pd.concat([df, df_pct], axis=1)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90.0 51 11 3 56.666667
1 43.0 15 12 8 34.883721
2 89.0 15 14 4 16.853933
3 89.0 61 40 0 68.539326
4 67.0 16 36 2 23.880597
5 88.0 14 78 7 15.909091
6 73.0 15 12 10 20.547945
7 78.0 1 0 11 1.282051
8 62.0 0 26 6 0.000000
9 101.0 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693

Another solution with div desired columns by column Respondents and then add to new columns names:
print ('percentage_' + df.columns[1:4])
Index(['percentage_answer_1', 'percentage_answer_2', 'percentage_answer_3'], dtype='object')
df['percentage_' + df.columns[1:4]] = df.ix[:,1:4].div(df.Respondents, axis=0) * 100
print (df)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to sum up a column based on another columns value Python - python

Related

Sum row values of all columns where column names meet string match condition

Multiply dataframe rows depends on value in this row

how to generate pandas dataframe basis list with condition

insert dataframe into a dataframe - Python/Pandas

Pandas: compute numerous columns of percentage values

Categories

Resources