Multiply dataframe rows depends on value in this row - python

I have a dataframe like this:
df = pd.DataFrame({'col1': [69, 77, 88],
'col2': ['bar34', 'barf30', 'barfoo29'],
'col3': [4, 2, 5]})
print(df, '\n')
col1 col2 col3
0 69 bar34 4
1 77 barf30 2
2 88 barfoo29 5
I need multiply rows depends on value in 'col3'. Desired output:
col1 col2 col3
0 69 bar34 4
1 69 bar34 4
2 69 bar34 4
3 69 bar34 4
4 77 barf30 2
5 77 barf30 2
6 88 barfoo29 5
7 88 barfoo29 5
8 88 barfoo29 5
9 88 barfoo29 5
10 88 barfoo29 5

Here is a solution, Index.repeat then DataFrame.reindex
df = df.set_index(df.col3)
print(
df.reindex(df.index.repeat(df.col3))
.reset_index(drop=True)
)
# suggested by #anky,
df.loc[df.index.repeat(df.col3)]
col1 col2 col3
0 69 bar34 4
1 69 bar34 4
2 69 bar34 4
3 69 bar34 4
4 77 barf30 2
5 77 barf30 2
6 88 barfoo29 5
7 88 barfoo29 5
8 88 barfoo29 5
9 88 barfoo29 5
10 88 barfoo29 5

I have only one solution, but i'm pretty sure it's not efficient at all.
import numpy as np
# Get columns list
cols = df.columns.to_list()
# Loop for each row to multiply
for index, row in df.iterrows():
# Loop for each new row we get
full_array = []
for new_row in range(row['col3']):
row_lst = [row[col_name] for col_name in cols]
full_array.append(row_lst)
numpy_data = np.array(full_array)
# Drop used row
df = df.drop([index])
# Creating mini_df
mini_df = pd.DataFrame(numpy_data, columns=columns)
# Concat with main dataframe
df = pd.concat([df, mini_df], ignore_index=True)
df = df.reset_index(drop=True)
print(df)
col1 col2 col3
0 69 bar34 4
1 69 bar34 4
2 69 bar34 4
3 69 bar34 4
4 77 barf30 2
5 77 barf30 2
6 88 barfoo29 5
7 88 barfoo29 5
8 88 barfoo29 5
9 88 barfoo29 5
10 88 barfoo29 5

Related

How to sum up a column based on another columns value Python

I have this example df
col1 = [1,1,1,2,2,1,1,1,2,2,2]
col2 = [20, 23, 12, 44, 14, 42, 44, 1, 42, 62, 11]
data = {"col1": col1, "col2": col2}
df = pd.DataFrame(data)
I need to add a column that adds up the col2 every time the col1 is 1 and then the same for when it is 2. I have tried grouping by col1 but this skips every time there is a 2 in between
The expected output would be this.
col1 col2 col3
1 20 55
1 23 55
1 12 55
2 44 58
2 14 58
1 42 87
1 44 87
1 1 87
2 42 115
2 62 115
2 11 115
Please let me know how to approach this
Use GroupBy.transform with helper Series for consecutive values generated by comapre shifted values for not equal and cumulative sum:
df['col3'] = df.groupby(df['col1'].ne(df['col1'].shift()).cumsum())['col2'].transform('sum')
print (df)
col1 col2 col3
0 1 20 55
1 1 23 55
2 1 12 55
3 2 44 58
4 2 14 58
5 1 42 87
6 1 44 87
7 1 1 87
8 2 42 115
9 2 62 115
10 2 11 115
You can do this by creating a column that will mark every time there is a change in col1 and then sum by groupby:
i = df.col1
df['Var3'] = i.ne(i.shift()).cumsum()
df['sums'] = df.groupby(['Var3'])['col2'].transform('sum')
which gives
col1 col2 Var3 sums
0 1 20 1 55
1 1 23 1 55
2 1 12 1 55
3 2 44 2 58
4 2 14 2 58
5 1 42 3 87
6 1 44 3 87
7 1 1 3 87
8 2 42 4 115
9 2 62 4 115
10 2 11 4 115
​

Creating a new Dataframe based on rows with certain values and removing the rows from the original Dataframe

I try to separate a Dataframe based on rows with a certain value in multiple columns, so that the original Dataframe is split in two with all rows containing the value in one Dataframe and the other Dataframe with the residual rows.
df = pd.DataFrame(np.random.randint(-1,100,size=(100, 4)), columns=list('ABCD'))
df
A B C D
0 51 86 15 80
1 61 53 75 66
2 80 48 23 58
3 86 25 37 99
4 50 11 87 71
... ... ... ... ...
95 34 40 43 40
96 89 16 83 72
97 97 32 24 26
98 27 83 75 29
99 24 50 40 43
100 rows × 4 columns
df[~df.isin([-1])].dropna()
A B C D
0 51 86 15 80.0
1 61 53 75 66.0
2 80 48 23 58.0
3 86 25 37 99.0
4 50 11 87 71.0
... ... ... ... ...
95 34 40 43 40.0
96 89 16 83 72.0
97 97 32 24 26.0
98 27 83 75 29.0
99 24 50 40 43.0
98 rows × 4 columns
df[df.isin([-1])].dropna()
A B C D
is what i tried so far and the first part worked correctly. However df[df.isin([-1])].dropna() failed.
Lets assume you would like to filter data by value equals to 80.
Possible solution is the following:
# pip install pandas
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(-1,100,size=(100, 4)), columns=list('ABCD'))
df
# df1 = df[~df.isin([80])].dropna().reset_index(drop=True)
# or
df1 = df[~df.eq(80).any(1)].reset_index(drop=True)
df1
df2 = df[df.eq(80).any(1)].reset_index(drop=True)
df2
Your code is almost correct. Use any(axis=1) to keep only one boolean value for each row instead of using dropna(how='all')
The same with a reproducible example:
import pandas as pd
import numpy as np
np.random.seed(2022)
vals = np.random.choice([-1, 0, 1], size=(10, 4), p=[.2, .4, .4])
df = pd.DataFrame(vals, columns=list('ABCD'))
m = df.isin([-1]).any(axis=1) # or df.eq(-1).any(axis=1)
df1, df2 = df[m], df[~m]
Output:
>>> df.assign(M=m)
A B C D M
0 -1 0 -1 -1 True
1 1 0 1 1 False
2 1 1 1 1 False
3 1 1 0 0 False
4 0 1 1 -1 True
5 1 0 0 1 False
6 -1 0 1 0 True
7 0 0 0 0 False
8 1 -1 1 0 True
9 1 1 0 1 False
>>> df1
A B C D
0 -1 0 -1 -1
4 0 1 1 -1
6 -1 0 1 0
8 1 -1 1 0
>>> df2
A B C D
1 1 0 1 1
2 1 1 1 1
3 1 1 0 0
5 1 0 0 1
7 0 0 0 0
9 1 1 0 1

How we can apply pandas groupby to get expected output?

Col1 Col2 Col3 Result
2 70 1 15
2 71 2 15
2 72 3 15
3 80 4 16
3 81 5 16
3 82 6 16
3 2 15 16
3 3 16 16
I am new to pandas, can anyone explain how get last column result to add my existing data frame?
Use Series.map with DataFrame.drop_duplicates for all unique rows by col2 values:
df['Res'] = df['Col1'].map(df.drop_duplicates('col2').set_index('Col2')['Col3'])
print (df)
Col1 Col2 Col3 Result Res
0 2 70 1 15 15
1 2 71 2 15 15
2 2 72 3 15 15
3 3 80 4 16 16
4 3 81 5 16 16
5 3 82 6 16 16
6 3 2 15 16 16
7 3 3 16 16 16
Another option is merge:
df.merge(df[['Col2','Col3']].rename(columns={'Col2':'Col1', 'Col3':'Res'}),
on='Col1', how='left')
Output:
Col1 Col2 Col3 Result Res
0 2 70 1 15 15
1 2 71 2 15 15
2 2 72 3 15 15
3 3 80 4 16 16
4 3 81 5 16 16
5 3 82 6 16 16
6 3 2 15 16 16
7 3 3 16 16 16

insert dataframe into a dataframe - Python/Pandas

Question is pretty self explanatory, how would you insert a dataframe with a couple of values in to a bigger dataframe at a given point (between index's 10 and 11). Meaning that .append cant be used
You can use concat with sliced df by loc:
np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(100, size=(5,6)), columns=list('ABCDEF'))
print (df1)
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
df2 = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3]})
print (df2)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#inserted between 4 and 5 index values
print (pd.concat([df1.loc[:4], df2, df1.loc[4:]], ignore_index=True))
A B C D E F
0 8 24 67 87 79 48
1 10 94 52 98 53 66
2 98 14 34 24 15 60
3 58 16 9 93 86 2
4 27 4 31 1 13 83
5 1 4 7 1 5 7
6 2 5 8 3 3 4
7 3 6 9 5 6 3
8 27 4 31 1 13 83

Pandas: compute numerous columns of percentage values

I'm failing to loop through the values of select dataframe columns in order to create new columns representing percentage values. Reproducible example:
data = {'Respondents': [90, 43, 89, '89', '67', '88', '73', '78', '62', '101'],
'answer_1': [51, 15, 15, 61, 16, 14, 15, 1, 0, 16],
'answer_2': [11, 12, 14, 40, 36, 78, 12, 0, 26, 78],
'answer_3': [3, 8, 4, 0, 2, 7, 10, 11, 6, 7]}
df = pd.DataFrame(data)
df
Respondents answer_1 answer_2 answer_3
0 90 51 11 3
1 43 15 12 8
2 89 15 14 4
3 89 61 35 0
4 67 16 36 2
5 88 14 78 7
6 73 15 12 10
7 78 1 0 11
8 62 0 26 6
9 101 16 78 7
The aim is to compute the percentage for each of the answers columns against the total respondents. For example, for the new answer_1 column – let's name it answer_1_perc – the first value would be 46 (because 51 is 46% of 90), the next value would be 35 (15 is 35% of 43). Then there would be answer_2_perc and answer_3_perc columns.
I've written so many iterations of the following code my head's spinning.
for columns in df.iloc[:, 1:4]:
for i in columns:
i_name = 'percentage_' + str(columns)
i_group = ([i] / df['Respondents'] * 100)
df[i_name] = i_group
What is the best way to do this? I need to use an iterative method as my actual data has 25 answer columns rather than the 3 shown in this example.
You almost had it, note that you have string values in respondents col which I've corrected prior to calling the following:
In [172]:
for col in df.columns[1:4]:
i_name = 'percentage_' + col
i_group = (df[col] / df['Respondents']) * 100
df[i_name] = i_group
df
Out[172]:
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693
I recommend using div and concat:
df['Respondents'] = df['Respondents'].astype(float)
df_pct = (df.drop('Respondents', axis=1)
.div(df['Respondents'], axis=0)
.mul(100)
.rename(columns=lambda col: 'percentage_' + col)
)
pd.concat([df, df_pct], axis=1)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90.0 51 11 3 56.666667
1 43.0 15 12 8 34.883721
2 89.0 15 14 4 16.853933
3 89.0 61 40 0 68.539326
4 67.0 16 36 2 23.880597
5 88.0 14 78 7 15.909091
6 73.0 15 12 10 20.547945
7 78.0 1 0 11 1.282051
8 62.0 0 26 6 0.000000
9 101.0 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693
Another solution with div desired columns by column Respondents and then add to new columns names:
print ('percentage_' + df.columns[1:4])
Index(['percentage_answer_1', 'percentage_answer_2', 'percentage_answer_3'], dtype='object')
df['percentage_' + df.columns[1:4]] = df.ix[:,1:4].div(df.Respondents, axis=0) * 100
print (df)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693

Categories

Resources