How to sum up a column based on another columns value Python - python

I have this example df
col1 = [1,1,1,2,2,1,1,1,2,2,2]
col2 = [20, 23, 12, 44, 14, 42, 44, 1, 42, 62, 11]
data = {"col1": col1, "col2": col2}
df = pd.DataFrame(data)
I need to add a column that adds up the col2 every time the col1 is 1 and then the same for when it is 2. I have tried grouping by col1 but this skips every time there is a 2 in between
The expected output would be this.
col1 col2 col3
1 20 55
1 23 55
1 12 55
2 44 58
2 14 58
1 42 87
1 44 87
1 1 87
2 42 115
2 62 115
2 11 115
Please let me know how to approach this

Use GroupBy.transform with helper Series for consecutive values generated by comapre shifted values for not equal and cumulative sum:
df['col3'] = df.groupby(df['col1'].ne(df['col1'].shift()).cumsum())['col2'].transform('sum')
print (df)
col1 col2 col3
0 1 20 55
1 1 23 55
2 1 12 55
3 2 44 58
4 2 14 58
5 1 42 87
6 1 44 87
7 1 1 87
8 2 42 115
9 2 62 115
10 2 11 115

You can do this by creating a column that will mark every time there is a change in col1 and then sum by groupby:
i = df.col1
df['Var3'] = i.ne(i.shift()).cumsum()
df['sums'] = df.groupby(['Var3'])['col2'].transform('sum')
which gives
col1 col2 Var3 sums
0 1 20 1 55
1 1 23 1 55
2 1 12 1 55
3 2 44 2 58
4 2 14 2 58
5 1 42 3 87
6 1 44 3 87
7 1 1 3 87
8 2 42 4 115
9 2 62 4 115
10 2 11 4 115
​

Related

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Multiply dataframe rows depends on value in this row

I have a dataframe like this:
df = pd.DataFrame({'col1': [69, 77, 88],
'col2': ['bar34', 'barf30', 'barfoo29'],
'col3': [4, 2, 5]})
print(df, '\n')
col1 col2 col3
0 69 bar34 4
1 77 barf30 2
2 88 barfoo29 5
I need multiply rows depends on value in 'col3'. Desired output:
col1 col2 col3
0 69 bar34 4
1 69 bar34 4
2 69 bar34 4
3 69 bar34 4
4 77 barf30 2
5 77 barf30 2
6 88 barfoo29 5
7 88 barfoo29 5
8 88 barfoo29 5
9 88 barfoo29 5
10 88 barfoo29 5
Here is a solution, Index.repeat then DataFrame.reindex
df = df.set_index(df.col3)
print(
df.reindex(df.index.repeat(df.col3))
.reset_index(drop=True)
)
# suggested by #anky,
df.loc[df.index.repeat(df.col3)]
col1 col2 col3
0 69 bar34 4
1 69 bar34 4
2 69 bar34 4
3 69 bar34 4
4 77 barf30 2
5 77 barf30 2
6 88 barfoo29 5
7 88 barfoo29 5
8 88 barfoo29 5
9 88 barfoo29 5
10 88 barfoo29 5
I have only one solution, but i'm pretty sure it's not efficient at all.
import numpy as np
# Get columns list
cols = df.columns.to_list()
# Loop for each row to multiply
for index, row in df.iterrows():
# Loop for each new row we get
full_array = []
for new_row in range(row['col3']):
row_lst = [row[col_name] for col_name in cols]
full_array.append(row_lst)
numpy_data = np.array(full_array)
# Drop used row
df = df.drop([index])
# Creating mini_df
mini_df = pd.DataFrame(numpy_data, columns=columns)
# Concat with main dataframe
df = pd.concat([df, mini_df], ignore_index=True)
df = df.reset_index(drop=True)
print(df)
col1 col2 col3
0 69 bar34 4
1 69 bar34 4
2 69 bar34 4
3 69 bar34 4
4 77 barf30 2
5 77 barf30 2
6 88 barfoo29 5
7 88 barfoo29 5
8 88 barfoo29 5
9 88 barfoo29 5
10 88 barfoo29 5

how to generate pandas dataframe basis list with condition

I have following list in python
movie_list = [11, 21, 31, 41, 51, 62, 55]
and following movie dataframe
userId movieId
1 11
1 21
1 31
2 62
2 55
Now what I want to do is generate similar dataframe, where movieId is not in dataframe, but there in movie_list
My desired dataframe would be
userId movieId
1 41
1 51
1 62
1 55
2 11
2 21
2 31
2 41
2 51
How can I do it in pandas?
IIUC, we can do the agg with list , then find the different between the original value in df with the movie_list
s=df.groupby('userId').movieId.agg(list).\
map(lambda x : list(set(movie_list)-set(x))).explode().reset_index()
userId movieId
0 1 41
1 1 51
2 1 62
3 1 55
4 2 41
5 2 11
6 2 51
7 2 21
8 2 31
One approach would be to use itertools.product to create all combinations of userId & movieId, then concat and drop_duplicates:
from itertools import product
movie_list = [11, 21, 31, 41, 51, 62, 55]
df_all = pd.DataFrame(product(df['userId'].unique(), movie_list), columns=df.columns)
df2 = pd.concat([df, df_all]).drop_duplicates(keep=False)
print(df2)
[out]
userId movieId
3 1 41
4 1 51
5 1 62
6 1 55
7 2 11
8 2 21
9 2 31
10 2 41
11 2 51
prod = pd.MultiIndex.from_product([df.userId.unique().tolist(), movie_list]).tolist()
(
pd.DataFrame(set(prod).difference([tuple(e) for e in df.values]),
columns=['userId', 'movieId'])
.sort_values(by=['userId', 'movieId'])
)
userId movieId
7 1 41
6 1 51
2 1 55
8 1 62
5 2 11
4 2 21
3 2 31
1 2 41
0 2 51
I think you need:
df = df.groupby("userId")["movieId"].apply(list).reset_index()
df["movieId"] = df["movieId"].apply(lambda x: list(set(movie_list)-set(x)))
df = df.explode("movieId")
print(df)
Output:
userId movieId
0 1 41
0 1 51
0 1 62
0 1 55
1 2 41
1 2 11
1 2 51
1 2 21
1 2 31

insert dataframe into a dataframe - Python/Pandas

Question is pretty self explanatory, how would you insert a dataframe with a couple of values in to a bigger dataframe at a given point (between index's 10 and 11). Meaning that .append cant be used
You can use concat with sliced df by loc:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(100, size=(5,6)), columns=list('ABCDEF'))
print (df1)
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
df2 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#inserted between 4 and 5 index values
print (pd.concat([df1.loc[:4], df2, df1.loc[4:]], ignore_index=True))
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
5 1 4 7 1 5 7
6 2 5 8 3 3 4
7 3 6 9 5 6 3
8 27 4 31 1 13 83

Pandas: compute numerous columns of percentage values

I'm failing to loop through the values of select dataframe columns in order to create new columns representing percentage values. Reproducible example:
data = {'Respondents': [90, 43, 89, '89', '67', '88', '73', '78', '62', '101'],
'answer_1': [51, 15, 15, 61, 16, 14, 15, 1, 0, 16],
'answer_2': [11, 12, 14, 40, 36, 78, 12, 0, 26, 78],
'answer_3': [3, 8, 4, 0, 2, 7, 10, 11, 6, 7]}
df = pd.DataFrame(data)
df
Respondents answer_1 answer_2 answer_3
0 90 51 11 3
1 43 15 12 8
2 89 15 14 4
3 89 61 35 0
4 67 16 36 2
5 88 14 78 7
6 73 15 12 10
7 78 1 0 11
8 62 0 26 6
9 101 16 78 7
The aim is to compute the percentage for each of the answers columns against the total respondents. For example, for the new answer_1 column – let's name it answer_1_perc – the first value would be 46 (because 51 is 46% of 90), the next value would be 35 (15 is 35% of 43). Then there would be answer_2_perc and answer_3_perc columns.
I've written so many iterations of the following code my head's spinning.
for columns in df.iloc[:, 1:4]:
for i in columns:
i_name = 'percentage_' + str(columns)
i_group = ([i] / df['Respondents'] * 100)
df[i_name] = i_group
What is the best way to do this? I need to use an iterative method as my actual data has 25 answer columns rather than the 3 shown in this example.
You almost had it, note that you have string values in respondents col which I've corrected prior to calling the following:
In [172]:
for col in df.columns[1:4]:
i_name = 'percentage_' + col
i_group = (df[col] / df['Respondents']) * 100
df[i_name] = i_group
df
Out[172]:
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693
I recommend using div and concat:
df['Respondents'] = df['Respondents'].astype(float)
df_pct = (df.drop('Respondents', axis=1)
.div(df['Respondents'], axis=0)
.mul(100)
.rename(columns=lambda col: 'percentage_' + col)
)
pd.concat([df, df_pct], axis=1)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90.0 51 11 3 56.666667
1 43.0 15 12 8 34.883721
2 89.0 15 14 4 16.853933
3 89.0 61 40 0 68.539326
4 67.0 16 36 2 23.880597
5 88.0 14 78 7 15.909091
6 73.0 15 12 10 20.547945
7 78.0 1 0 11 1.282051
8 62.0 0 26 6 0.000000
9 101.0 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693
Another solution with div desired columns by column Respondents and then add to new columns names:
print ('percentage_' + df.columns[1:4])
Index(['percentage_answer_1', 'percentage_answer_2', 'percentage_answer_3'], dtype='object')
df['percentage_' + df.columns[1:4]] = df.ix[:,1:4].div(df.Respondents, axis=0) * 100
print (df)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693

Categories

Resources