groupby and sum two columns and set as one column in pandas - python

I have the following data frame:
import pandas as pd
data = pd.DataFrame()
data['Home'] = ['A','B','C','D','E','F']
data['HomePoint'] = [3,0,1,1,3,3]
data['Away'] = ['B','C','A','E','D','D']
data['AwayPoint'] = [0,3,1,1,0,0]
i want to groupby the columns ['Home', 'Away'] and change the name as Team. Then i like to sum homepoint and awaypoint as name as Points.
Team Points
A 4
B 0
C 4
D 1
E 4
F 3
How can I do it?
I was trying different approach using the following post:
Link
But I was not able to get the format that I wanted.
Greatly appreciate your advice.
Thanks
Zep.

A simple way is to create two new Series indexed by the teams:
home = pd.Series(data.HomePoint.values, data.Home)
away = pd.Series(data.AwayPoint.values, data.Away)
Then, the result you want is:
home.add(away, fill_value=0).astype(int)
Note that home + away does not work, because team F never played away, so would result in NaN for them. So we use Series.add() with fill_value=0.
A complicated way is to use DataFrame.melt():
goo = data.melt(['HomePoint', 'AwayPoint'], var_name='At', value_name='Team')
goo.HomePoint.where(goo.At == 'Home', goo.AwayPoint).groupby(goo.Team).sum()
Or from the other perspective:
ooze = data.melt(['Home', 'Away'])
ooze.value.groupby(ooze.Home.where(ooze.variable == 'HomePoint', ooze.Away)).sum()

You can concatenate, pairwise, columns of your input dataframe. Then use groupby.sum.
# calculate number of pairs
n = int(len(df.columns)/2)+1)
# create list of pairwise dataframes
df_lst = [data.iloc[:, 2*i:2*(i+1)].set_axis(['Team', 'Points'], axis=1, inplace=False) \
for i in range(n)]
# concatenate list of dataframes
df = pd.concat(df_lst, axis=0)
# perform groupby
res = df.groupby('Team', as_index=False)['Points'].sum()
print(res)
Team Points
0 A 4
1 B 0
2 C 4
3 D 1
4 E 4
5 F 3

Related

Cannot set a DataFrame with multiple columns to the single column total_servings

I am a beginner and getting familiar with pandas .
It is throwing an error , When I was trying to create a new column this way :
drinks['total_servings'] = drinks.loc[: ,'beer_servings':'wine_servings'].apply(calculate,axis=1)
Below is my code, and I get the following error for line number 9:
"Cannot set a DataFrame with multiple columns to the single column total_servings"
Any help or suggestion would be appreciated :)
import pandas as pd
drinks = pd.read_csv('drinks.csv')
def calculate(drinks):
return drinks['beer_servings']+drinks['spirit_servings']+drinks['wine_servings']
print(drinks)
drinks['total_servings'] = drinks.loc[:, 'beer_servings':'wine_servings'].apply(calculate,axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
drinks
In your code, when functioncalculate is called with axis=1, it passes each row of the Dataframe as an argument. Here, the function calculate is returning dataframe with multiple columns but you are trying to assigned to a single column, which is not possible. You can try updating your code to this,
def calculate(each_row):
return each_row['beer_servings'] + each_row['spirit_servings'] + each_row['wine_servings']
drinks['total_servings'] = drinks.apply(calculate, axis=1)
drinks['beer_sales'] = drinks['beer_servings'].apply(lambda x: x*2)
drinks['spirit_sales'] = drinks['spirit_servings'].apply(lambda x: x*4)
drinks['wine_sales'] = drinks['wine_servings'].apply(lambda x: x*6)
print(drinks)
I suppose the reason is the wrong argument name inside calculate method. The given argument is drink but drinks used to calculate sum of columns.
The reason is drink is Series object that represents Row and sum of its elements is scalar. Meanwhile drinks is a DataFrame and sum of its columns will be a Series object
Sample code shows that this method works.
import pandas as pd
df = pd.DataFrame({
"A":[1,1,1,1,1],
"B":[2,2,2,2,2],
"C":[3,3,3,3,3]
})
def calculate(to_calc_df):
return to_calc_df["A"] + to_calc_df["B"] + to_calc_df["C"]
df["total"] = df.loc[:, "A":"C"].apply(calculate, axis=1)
print(df)
Result
A B C total
0 1 2 3 6
1 1 2 3 6
2 1 2 3 6
3 1 2 3 6
4 1 2 3 6

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.
Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360
You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

How to extract values from a list in Python and put into a dataframe

I have trained a model and have asked the model to produce the coefficients:
modelcoeffs = model.fit(X_train, y_train).coef_
coeffslist = list(modelcoeffs)
which yiels me for example:
print(coeffslist):
[0.17005542 0.72965947 0.6833308 0.02509676]
I am trying to split these 4 coefficients out however they dont seem to be individual elements?
does anyone know how to split these into four numbers?
I am trying to get:
df['1'] = coeffslist[0]
df['2'] = coeffslist[1]
df['3'] = coeffslist[2]
df['4'] = coeffslist[3]
But it gives me NaN in the df. Does anyone have any ideas? thanks!
UPDATE
I am basically trying to get the coeffs to append to a df
print(df)
1 2 3 4
.... ..... ..... .....
0.17005542 0.72965947 0.6833308 0.02509676
This coeffslist doesn't look like a valid Python structure, it's missing commas.
But you might try this:
import pandas as pd
df = pd.DataFrame([0.17005542, 0.72965947, 0.6833308, 0.02509676])
print(df)
Output:
0
0 0.170055
1 0.729659
2 0.683331
3 0.025097
To get the coefs as row try this:
import pandas as pd
df = pd.DataFrame(columns=list("1234"))
df.loc[len(df)] = [0.17005542, 0.72965947, 0.6833308, 0.02509676]
print(df)
Output:
1 2 3 4
0 0.170055 0.729659 0.683331 0.025097
And if you want to add another row (append) of coefs, just do this:
df.loc[1] = [0.17005542, 0.72965947, 0.6833308, 0.02509676]
print(df)
Output:
1 2 3 4
0 0.170055 0.729659 0.683331 0.025097
1 0.170055 0.729659 0.683331 0.025097
you can convert [0.17005542 0.72965947 0.6833308 0.02509676] to a sting, split it on space, convert to float again and then append to a dataframe.
str_list= str(coeffslist[0])
float_list= [float(x) for x in str_list.split()]
df=pd.DataFrame(columns=['1','2','3','4'])
a_series = pd.Series(float_list, index = df.columns)
df = df.append(a_series, ignore_index=True)

I want to add the values of two cells present in the same column based on their " index = somevalue"

I have a data frame with the column "Key" as index like below:
Key
Prediction
C11D0 0
C11D1 8
C12D0 1
C12D1 5
C13D0 3
C13D1 9
C14D0 4
C14D1 9
C15D0 5
C15D1 3
C1D0 5
C2D0 7
C3D0 4
C4D0 1
C4D1 9
I want to add the values of two cells in Prediction column when their "index = something". The logic is I want to add the values whose index matches for upto 4 letters. Example: indexes having "C11D0 & C11D1" or having "C14D0 & C14D1" ? Then the output will be:
Operation
Addition Result
C11D0+C11D1 8
C12D0+C12D1 6
C13D0+C13D1 12
you can use isin function.
Example:
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4,5,6], 'value':[1,2,1,3,7,1]})
df[df.id.isin([1,5,6])].value.sum()
output:
9
for your case
idx = ['C11D0', 'C11D1']
print(df[df.Key.isin(idx)].Prediction.sum()) #outputs 8
First set key as a column if it is the index:
df.reset_index(inplace=True)
Then you can use DataFrame.loc with boolean indexing:
df.loc[df['key'].isin(["C11D0","C11D1"]),'Prediciton'].sum()
You can also create a function for it:
def sum_select_df(key_list,df):
return pd.concat([df[df['Key'].isin(['C'+str(key)+'D1','C'+str(key)+'D0'])] for key in key_list])['Prediction'].sum()
sum_select_df([11,14],df)
Output:
21
Here is a complete solution, slightly different from the other answers so far. I tried to make it pretty self-explanatory, but let me know if you have any questions!
import numpy as np # only used to generate test data
import pandas as pd
import itertools as itt
start_inds = ["C11D0", "C11D1", "C12D0", "C12D1", "C13D0", "C13D1", "C14D0", "C14D1",
"C15D0", "C15D1", "C1D0", "C2D0", "C3D0", "C4D0", "C4D1"]
test_vals = np.random.randint(low=0, high=10, size=len(start_inds))
df = pd.DataFrame(data=test_vals, index=start_inds, columns=["prediction"])
ind_combs = itt.combinations(df.index.array, 2)
sum_records = ((f"{ind1}+{ind2}", df.loc[[ind1, ind2], "prediction"].sum())
for (ind1, ind2) in ind_combs if ind1[:4] == ind2[:4])
res_ind, res_vals = zip(*sum_records)
res_df = pd.DataFrame(data=res_vals, index=res_ind, columns=["sum_result"])

pandas dataframe contains list

I have currently run the following script which uses Fuzzylogic to replace some common words from the list. Dataframe df1 contains my default list of possible values. Dataframe df2 is the main dataframe where transformations/changes are undertaken after referring to Dataframe df1. The code is as follows:
df1 = pd.DataFrame(['one','two','three','four','five','tsst'])
df2 = pd.DataFrame({'not_shifted':[np.nan,'one','too','three','fours','five','six',np.nan,'test']})
# Drop nan value
df2=pd.DataFrame(df2['not_shifted'].fillna(value=''))
df2['not_shifted'] = df2['not_shifted'].map(lambda x: difflib.get_close_matches(x, df1[0]))
The problem is the output is a dataframe which contains square brackets. To make matters worse, none of the texts within df2['not_shifted'] are viewable/ recallable:
Out[421]:
not_shifted
0 []
1 [one]
2 [two]
3 [three]
4 [four]
5 [five]
6 []
7 []
8 [tsst]
Please help.
df2.not_shifted.apply(lambda x: x[0] if len(x) != 0 else "") or simply df2.not_shifted.str[0] as solved by #Psidom
def replace_all(eg):
rep = {"[":"",
"]":"",
"u":"",
"}":"",
"'":"",
'"':"",
"frozenset":""}
for i,j in rep.items():
eg = eg.replace(i,j)
return eg
for each in df.columns:
df[each] = df[each].apply(lambda x : replace_all(str(x)))

Categories

Resources