How to simplify pandas columns sum? - python

I try to sum columns, like the following:
The data frame:
ID name grade_math grade_chemistry grade_physic CS_math CS_chemistry CS_physic
1 A 4 2.75 3 3 2 3
2 B 3 4 4 3 2 3
3 C 2 2 2 3 2 3
the formula is:
df['total'] = (df['grade_math']*df['CS_math']) + (df['grade_chemistry']*df['CS_chemistry']) + (df['grade_physic']*df['CS_physic']
but I've tried to simplify like this:
df['total'] = sum(df[f'grade{i}'] * df[f'CS{i}'] for i in range(1, 3))
but I realized, this logic is totally wrong. Any suggestions?

You were close in your logic. What you're after is this:
sum(df[f'grade_{subject}'] * df[f'CS_{subject}'] for subject in ["math", "chemistry", "physic"])
The issue was that when you were for i in range(1, 3), you were iterating over numbers. Placing them into f-strings will therefore result in strings like CS1, CS2, etc. These strings don't exist in the columns of your dataframe.
Therefore, in the provided solution you can notice that we iterate over the common suffixes ("math", "chemistry", and "physic") so that the f-strings results are found in the columns of the dataframe.

Use:
sum(df[f'grade_{i}'] * df[f'CS_{i}'] for i in ['math', 'chemistry', 'physic'])
Output:
0 26.5
1 29.0
2 16.0
dtype: float64

Related

Find where word is present in string with where statement [duplicate]

I having replace issue while I try to replace a string with value from another column.
I want to replace 'Length' with df['Length'].
df["Length"]= df["Length"].replace('Length', df['Length'], regex = True)
Below is my data
Input:
**Formula** **Length**
Length 5
Length+1.5 6
Length-2.5 5
Length 4
5 5
Expected Output:
**Formula** **Length**
5 5
6+1.5 6
5-2.5 5
4 4
5 5
However, with the code I used above, it will replace my entire cell instead of Length only.
I getting below output:
I found it was due to df['column'] is used, if I used any other string the behind offset (-1.5) will not get replaced.
**Formula** **Length**
5 5
6 6
5 5
4 4
5 5
May I know is there any replace method for values from other columns?
Thank you.
If want replace by another column is necessary use DataFrame.apply:
df["Formula"]= df.apply(lambda x: x['Formula'].replace('Length', str(x['Length'])), axis=1)
print (df)
Formula Length
0 5 5
1 6+1.5 6
2 5-2.5 5
3 4 4
4 5 5
Or list comprehension:
df["Formula"]= [x.replace('Length', str(y)) for x, y in df[['Formula','Length']].to_numpy()]
Just wanted to add, that list comprehension is much faster of course:
df = pd.DataFrame({'a': ['aba'] * 1000000, 'c': ['c'] * 1000000})
%timeit df.apply(lambda x: x['a'].replace('b', x['c']), axis=1)
# 1 loop, best of 5: 11.8 s per loop
%timeit [x.replace('b', str(y)) for x, y in df[['a', 'c']].to_numpy()]
# 1 loop, best of 5: 1.3 s per loop

python remove duplicate substring parsed by comma

I have an input Pandas Series like this:
I would like to remove duplicates in each row. For example, change M,S,S to M,S.
I tried
fifa22['player_positions'] = fifa22['player_positions'].str.split(',').apply(pd.unique)
But the results are a Series of ndarray
I would like to convert the results to simple string, without the square bracket. Wondering what to do, thanks!
If it only on this one column, you should use map.
import pandas as pd
df = pd.DataFrame({
'player_positions' : "M,S,S S S,M M,M M,M M M,S S,M,M,S".split(' ')
})
print(df)
player_positions
0 M,S,S
1 S
2 S,M
3 M,M
4 M,M
5 M
6 M,S
7 S,M,M,S
out = df['player_positions'].map(lambda x: ','.join(set(x.split(','))))
print(out)
0 M,S
1 S
2 M,S
3 M
4 M
5 M
6 M,S
7 M,S
If you want to concatenate in any other way just change the , in ','.join(...) to anything else.

I need to create a dataframe were values reference previous rows

I am just starting to use python and im trying to learn some of the general things about it. As I was playing around with it I wanted to see if I could make a dataframe that shows a starting number which is compounded by a return. Sorry if this description doesnt make much sense but I basically want a dataframe x long that shows me:
number*(return)^(row number) in each row
so for example say number is 10 and the return is 10% so i would like for the dataframe to give me the series
1 11
2 12.1
3 13.3
4 14.6
5 ...
6 ...
Thanks so much in advanced!
Let us try
import numpy as np
val = 10
det = 0.1
n = 4
out = 10*((1+det)**np.arange(n))
s = pd.Series(out)
s
Out[426]:
0 10.00
1 11.00
2 12.10
3 13.31
dtype: float64
Notice here I am using the index from 0 , since 1.1**0 will yield the original value
I think this does what you want:
df = pd.DataFrame({'returns': [x for x in range(1, 10)]})
df.index = df.index + 1
df.returns = df.returns.apply(lambda x: (10 * (1.1**x)))
print(df)
Out:
returns
1 11.000000
2 12.100000
3 13.310000
4 14.641000
5 16.105100
6 17.715610
7 19.487171
8 21.435888
9 23.579477

Get all row values from single column

This code is working and retrieving the exact data i need. I just need to output all the row values into a single value;
HERE IS THE bit of CODE
dfs = pd.read_html(str(tables))
df = dfs[0].iloc[[1,3,5,7,9,11,13,15,17,19],[1]]
s = df['score'].str.split('-',expand=True).astype(int)
df['team_win'] = np.where(s[0] == s[1], 0,s.idxmax(1) + 1)
df = df['team_win'].drop([9, 11])
HERE IS THE OUTPUT
Name: team_win, dtype: int64
1 1
3 0
5 2
7 0
13 1
15 2
17 2
19 2
I need the team_win to be outputed like this... 10201222 so i can copy into google sheet
Ranked from best performance to worst performance, you could:
1)
''.join(map(str, df['team_win']))
Out[319]: '10201222'
2)
''.join(df['team_win'].map(str))
Out[320]: '10201222'
3)
''.join([str(i) for i in df['team_win']])
Out[313]: '10201222'
Thanks to #piRSquared for the additional suggestions

simplifying routine in python with numpy array or pandas

The initial problem is the following: I have an initial matrix with let say 10 lines and 12 rows. For all lines, I want to sum two rows together. At the end I must have 10 lines but with only 6 rows. Currently, I am doing the following for loop in python (using initial which is a pandas DataFrame)
for i in range(0,12,2):
coarse[i]=initial.iloc[:,i:i+1].sum(axis=1)
In fact, I am quite sure that something more efficient is possible. I am thinking something like list comprehension but for a DataFrame or a numpy array. Does anybody have an idea ?
Moreover I would want to know if it is better to manipulate large numpy arrays or pandas DataFrame.
Let's create a small sample dataframe to illustrate the solution:
np.random.seed(0)
df = pd.DataFrame(np.random.rand(6, 3))
>>> df
0 1 2
0 0.548814 0.715189 0.602763
1 0.544883 0.423655 0.645894
2 0.437587 0.891773 0.963663
3 0.383442 0.791725 0.528895
4 0.568045 0.925597 0.071036
5 0.087129 0.020218 0.832620
You can use slice notation to select every other row starting from the first row (::2) and starting from the second row (1::2). iloc is for integer indexing. You need to select the values at these locations, and add them together. The result is a numpy array that you could then convert back into a DataFrame if required.
>>> df.iloc[::2].values + df.iloc[1::2].values
array([[ 1.09369669, 1.13884417, 1.24865749],
[ 0.82102873, 1.68349804, 1.49255768],
[ 0.65517386, 0.94581504, 0.9036559 ]])
You use values to remove the indexing. This is what happens otherwise:
>>> df.iloc[::2] + df.iloc[1::2].values
0 1 2
0 1.093697 1.138844 1.248657
2 0.821029 1.683498 1.492558
4 0.655174 0.945815 0.903656
>>> df.iloc[::2].values + df.iloc[1::2]
0 1 2
1 1.093697 1.138844 1.248657
3 0.821029 1.683498 1.492558
5 0.655174 0.945815 0.903656
For a more general solution:
df = pd.DataFrame(np.random.rand(9, 3))
n = 3 # Number of consecutive rows to group.
df['group'] = [idx // n for idx in range(len(df.index))]
df.groupby('group').sum()
0 1 2
group
0 1.531284 2.030617 2.212320
1 1.038615 1.737540 1.432551
2 1.695590 1.971413 1.902501

Categories

Resources