In R's ddply function, you can compute any new columns group-wise, and append the result to the original dataframe, such as:
ddply(mtcars, .(cyl), transform, n=length(cyl)) # n is appended to the df
In Python/pandas, I have computed it first, and then merge, such as:
df1 = mtcars.groupby("cyl").apply(lambda x: Series(x["cyl"].count(), index=["n"])).reset_index()
mtcars = pd.merge(mtcars, df1, on=["cyl"])
or something like that.
However, I always feel like that's pretty daunting, so is it feasible to do it all once?
Thanks.
You can add a column to a DataFrame by assigning the result of a groupby/transform operation to it:
mtcars['n'] = mtcars.groupby("cyl")['cyl'].transform('count')
import pandas as pd
import pandas.rpy.common as com
mtcars = com.load_data('mtcars')
mtcars['n'] = mtcars.groupby("cyl")['cyl'].transform('count')
print(mtcars.head())
yields
mpg cyl disp hp drat wt qsec vs am gear carb n
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 7
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 7
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 11
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 7
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 14
To add multiple columns, you could use groupby/apply. Make sure the function you apply returns a DataFrame with the same index as its input. For example,
mtcars[['n','total_wt']] = mtcars.groupby("cyl").apply(
lambda x: pd.DataFrame({'n': len(x['cyl']), 'total_wt': x['wt'].sum()},
index=x.index))
print(mtcars.head())
yields
mpg cyl disp hp drat wt qsec vs am gear carb n total_wt
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 7 21.820
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 7 21.820
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 11 25.143
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 7 21.820
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 14 55.989
Related
My data is related to "Cricket", sports game (like Baseball). It has 20 overs for each inning max and each over has approx 6 balls.
data:
season match_id inning sum_total_runs sum_total_wickets over/ball innings_score
32 2008 60 1 61 0 5.1 0
33 2008 60 1 61 1 5.2 0
34 2008 60 1 61 1 5.3 0
35 2008 60 1 61 1 5.4 0
36 2008 60 1 61 1 5.5 0
... ... ... ... ... ... ... ...
179073 2019 11415 2 152 5 19.2 0
179074 2019 11415 2 154 5 19.3 0
179075 2019 11415 2 155 6 19.4 0
179076 2019 11415 2 157 6 19.5 0
179077 2019 11415 2 157 7 19.6 0
111972 rows × 7 columns
innings_score is new column created by me (given default value 0). I want to update it.
The values that I want to enter in it are the results of df.groupby below.
In[]:
df.groupby(['season', 'match_id', 'inning'])['sum_total_runs'].max()
Out[]:
season match_id inning
2008 60 1 222
2 82
61 1 240
2 207
62 1 129
...
2019 11413 2 170
11414 1 155
2 162
11415 1 152
2 157
Name: sum_total_runs, Length: 1276, dtype: int64
I want innings_score to be like:
season match_id inning sum_total_runs sum_total_wickets over/ball innings_score
32 2008 60 1 61 0 5.1 222
33 2008 60 1 61 1 5.2 222
34 2008 60 1 61 1 5.3 222
35 2008 60 1 61 1 5.4 222
36 2008 60 1 61 1 5.5 222
... ... ... ... ... ... ... ...
179073 2019 11415 2 152 5 19.2 157
179074 2019 11415 2 154 5 19.3 157
179075 2019 11415 2 155 6 19.4 157
179076 2019 11415 2 157 6 19.5 157
179077 2019 11415 2 157 7 19.6 157
111972 rows × 7 columns
I would use assign. Starting from a simple example:
import pandas as pd
dt = pd.DataFrame({"name1":["A", "A", "B", "B", "C", "C"], "name2":["C", "C", "C", "D", "D", "D"], "value":[1, 2, 3, 4, 5, 6]})
grouping_variables = ["name1", "name2"]
dt = dt.set_index(grouping_variables)
dt = dt.assign(new_column=dt.groupby(grouping_variables)["value"].max())
As you can see, you set your grouping_variables as indeces before running the assignment.
You can always reset the index at the end if you don't want to keep the grouping_variables indexed dataframe:
dt.reset_index()
One way is to set those 3 columns as index and assign the groupby result as a new column and reset index after that.
While those columns are index, the grouby result and the dataframe both have similar index, so pandas will automatically match and insert the correct rows in the correct positions. Then reset index will turn them back into normal columns.
Something like this:
In [46]: df
Out[46]:
season match_id inning sum_total_runs sum_total_wickets over/ball
0 2008 60 1 61 0 5.1
1 2008 60 1 61 1 5.2
2 2008 60 1 61 1 5.3
3 2008 60 1 61 1 5.4
4 2008 60 1 61 1 5.5
5 2019 11415 2 152 5 19.2
6 2019 11415 2 154 5 19.3
7 2019 11415 2 155 6 19.4
8 2019 11415 2 157 6 19.5
9 2019 11415 2 157 7 19.6
In [47]: df.set_index(['season', 'match_id', 'inning']).assign(innings_score=df.groupby(['season', 'match_id', 'inning'])['sum_total_runs'].max()).reset_index()
Out[47]:
season match_id inning sum_total_runs sum_total_wickets over/ball innings_score
0 2008 60 1 61 0 5.1 61
1 2008 60 1 61 1 5.2 61
2 2008 60 1 61 1 5.3 61
3 2008 60 1 61 1 5.4 61
4 2008 60 1 61 1 5.5 61
5 2019 11415 2 152 5 19.2 157
6 2019 11415 2 154 5 19.3 157
7 2019 11415 2 155 6 19.4 157
8 2019 11415 2 157 6 19.5 157
9 2019 11415 2 157 7 19.6 157
my dataset
df
Month 1 2 3 4 5 Label
Name
A 120 80.5 120 105.5 140 0
B 80 110 98.5 105 100 1
C 150 90.5 105 120 190 2
D 100 105 98.5 110 120 1
...
To draw a plot for Month, applying the inverse matrix,
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0.00 1.0 2.0 1.000
Ultimately what I want to do is Drawing a plot, the x-axis is this month, y-axis is value.
but,
I have two questions.
Q1.
To inverse matrix, the data type of 'label' is changed(int -> float),
Can only the index of the 'label' be set to int type?
output what I want
df = df.T
df
Name A B C D
Month
1 120 80 150 100
2 80.05 110 90.5 105
3 130 98.5 105 98.005
4 105.5 105 120 110
5 140 100 190 120
Label 0 1 2 1
Q2.
q1 is actually for q2.
When drawing a plot, I want to group it using a label.(Like seaborn hue)
When drawing a plot using the pivot table above, is there a way for grouping to be possible?
(matplotlib, sns method does not matter)
The label above doesn't have to be int, and if possible, you don't need to answer the q1 task.
thank you for reading
Q2: You need reshape values, e.g. here with DataFrame.melt for possible use hue:
df1 = df.reset_index().melt(['Name','Label'])
print (df1)
sns.stripplot(data=df1,hue='Label',x='Name',y='value')
Q1: Pandas not support it, e.g. if convert last row label it not change values to floats:
df = df.T
df.loc['Label', :] = df.loc['Label', :].astype(int)
print (df)
Name A B C D
1 120.0 80.0 150.0 100.0
2 80.5 110.0 90.5 105.0
3 120.0 98.5 105.0 98.5
4 105.5 105.0 120.0 110.0
5 140.0 100.0 190.0 120.0
Label 0.0 1.0 2.0 1.0
EDIT:
df1 = df.reset_index().melt(['Name','Label'], var_name='Month')
print (df1)
Name Label Month value
0 A 0 1 120.0
1 B 1 1 80.0
2 C 2 1 150.0
3 D 1 1 100.0
4 A 0 2 80.5
5 B 1 2 110.0
6 C 2 2 90.5
7 D 1 2 105.0
8 A 0 3 120.0
9 B 1 3 98.5
10 C 2 3 105.0
11 D 1 3 98.5
12 A 0 4 105.5
13 B 1 4 105.0
14 C 2 4 120.0
15 D 1 4 110.0
16 A 0 5 140.0
17 B 1 5 100.0
18 C 2 5 190.0
19 D 1 5 120.0
sns.lineplot(data=df1,hue='Label',x='Month',y='value')
I have a dataframe that contains missing values.
index month value
0 201501 100
1 201507 172
2 201602 181
3 201605 98
I want to fill the missing value of the above data frame with the list below.
list = [201501, 201502, 201503 ... 201612]
The result I want to get...
index month value
0 201501 100
1 201502 100
2 201503 100
3 201504 100
4 201505 100
5 201506 100
6 201507 172
7 201508 172
...
...
23 201611 98
34 201612 98
Setup
my_list = list(range(201501,201509))
df=df.drop('index',axis=1) #remove the column index after use pd.read_clipboard
print(df)
month value
0 201501 100
1 201507 172
2 201602 181
3 201605 98
pd.DataFrame.reindex
df = (df.set_index('month')
.reindex( index = np.sort(np.unique(df['month'].tolist() + my_list)) )
.ffill()
.reset_index() )
print(df)
month value
0 201501 100.0
1 201502 100.0
2 201503 100.0
3 201504 100.0
4 201505 100.0
5 201506 100.0
6 201507 172.0
7 201508 172.0
8 201602 181.0
9 201605 98.0
10 201612 98.0
Using pandas.DataFrame.merge:
l = list(range(201501,201509))
new_df = df.merge(pd.Series(l,name='month'),how='outer').sort_values('month').ffill()
new_df['index'] = range(new_df.shape[0])
Output:
index month value
0 0 201501 100.0
4 1 201502 100.0
5 2 201503 100.0
6 3 201504 100.0
7 4 201505 100.0
8 5 201506 100.0
1 6 201507 172.0
9 7 201508 172.0
2 8 201602 181.0
3 9 201605 98.0
[
I have the below dataframe and I would like to return the average values of 'Age' and 'Sales' for each Flavor 'Chocolate' or 'Vanilla' so the average Age of 'Vanilla' is x, the average Age of 'Chocolate' is y, etc.
I haven't been able to find the answer anywhere on the web and I'm stuck.
print(MergeData.head())
Customer Type Flavor Age Sales Store Goals Goal FlavorCode \
0 1 Adult Chocolate 45 4.25 Greeley 25 25 C
1 2 Child Vanilla 5 2.90 Greeley 25 25 V
2 6 Teenager Chocolate 16 4.10 Greeley 25 25 C
3 8 Child Vanilla 4 3.00 Greeley 25 25 V
4 10 Child Vanilla 6 2.50 Greeley 25 25 V
AgeBin1 AgeBin2
0 (28.0, 72.0] B
1 (3.999, 14.0] A
2 (14.0, 28.0] A
3 (3.999, 14.0] A
4 (3.999, 14.0] A
IIUC:
df.groupby(['Flavor'])['Age','Sales'].transform('mean')
Demo:
print(df.groupby(['Flavor'])['Age','Sales'].transform('mean'))
Output:
Age Sales
0 30.5 4.175
1 5.0 2.800
2 30.5 4.175
3 5.0 2.800
4 5.0 2.800
you can even use df.loc ..
just using an example dataset here
>>> df
Name Score1 Score2
0 Alisa 62.2 89
1 Bobby 47.4 87
2 Cathrine 55.5 67
3 Madonna 74.6 55
4 Rocky 31.2 47
5 Sebastian 77.5 72
6 Jaqluine 85.6 76
7 Rahul 63.5 79
8 David 42.8 44
9 Andrew 32.3 92
10 Ajay 71.2 99
11 Teresa 57.4 69
mean of the dataFrame ..
>>> df.mean()
Score1 58.433333
Score2 73.000000
dtype: float64
For a particular column:
>>> df.loc[:,"Score1":"Score2"].mean()
Score1 58.433333
Score2 73.000000
dtype: float64
I have two dicts, one with three columns (A) and another with six columns (B), I would like to be able to use the value in the first column (index which is constant for both 1-4) and also the value in the second column (1-2000) to specify the correct element in the third column for subtraction. The second dict is similar in that the first and second columns are used to find the correct row however it is the value in the sixth column of that row that is needed for the subtraction.
A B
1 1 260 541 1 1 260 280 0.001 521.4
1 1 390 1195 1 1 390 900 0.02 963.3
1 1 102 6 1 1 102 2 0.01 4.8
2 1 65 12 2 1 65 9 0.13 13.1
2 1 515 659 2 1 515 356 0.002 532.2
2 1 354 1200 2 1 354 1087 0.119 1502.3
3 1 1190 53 3 1 1190 46 0.058 12.0
3 1 1985 3 3 1 1985 1 0.006 1.02
3 1 457 192 3 1 25 3 0.001 178.2
4 1 261 2084 4 1 261 1792 0.196 100.7
4 1 12 0 4 1 12 0 0.000 12.6
4 1 1756 30 4 1 1756 28 0.006 23.7
4 1 592 354 4 1 592 291 0.357 251.9
So basically I would like to subtract the last column of B from the last column of A whilst retaining the information held in the first and second columns.
C (desired output)
1 1 260 19.6
1 1 390 231.7
1 1 102 1.2
2 1 65 -1.1
2 1 515 126.8
2 1 354 -302.3
3 1 1190 41.0
3 1 1985 1.98
3 1 457 13.8
4 1 261 1983.3
4 1 12 -12.6
4 1 1756 6.3
4 1 592 102.1
I have been through SO for hours looking for a solution but havent found a solution as of yet but I'm sure it must be possible.
I need to be able to create a scatter graph afterwards as well in case anyone has any suggestions as to how to plot positive values and ignore the negatives.
EDIT:
I have added my code below to make it clearer, I take in a three column csv file and then need to get a count of the frequency of each value of the third column when they have the same value in the first column. B then has further alterations to get out the desired data streams and then the subtraction needs to be made. In a few of the comments it mentioned that column one and two are unnecessary but the value in column three is linked to the value in column one and thus must always remain in the same row together.
import pandas as pd
import numpy as np
def ba(fn, float1, float2):
ba=pd.read_csv(fn,header=None, skipfooter=6, engine='python')
ba['col4']=ba.groupby(['col1','col3']).transform(np.size)
ba['col5']=ba['col4'].apply(lambda x: x/float(float2))
ba['col6']=ba['col5'].apply(lambda x: x*float1)
ba=ba.set_index('col1')
ba = dict(tuple(ba.groupby('col1')))
return ba
IIUIC, A and B are dataframes then
In [1062]: A.iloc[:, :3].assign(output=A.iloc[:, -1] - B.iloc[:, -1])
Out[1062]:
0 1 2 output
0 1 1 260 19.60
1 1 1 390 231.70
2 1 1 102 1.20
3 2 1 65 -1.10
4 2 1 515 126.80
5 2 1 354 -302.30
6 3 1 1190 41.00
7 3 1 1985 1.98
8 3 1 457 13.80
9 4 1 261 1983.30
10 4 1 12 -12.60
11 4 1 1756 6.30
12 4 1 592 102.10
Details
In [1063]: A
Out[1063]:
0 1 2 3
0 1 1 260 541
1 1 1 390 1195
2 1 1 102 6
3 2 1 65 12
4 2 1 515 659
5 2 1 354 1200
6 3 1 1190 53
7 3 1 1985 3
8 3 1 457 192
9 4 1 261 2084
10 4 1 12 0
11 4 1 1756 30
12 4 1 592 354
In [1064]: B
Out[1064]:
0 1 2 3 4 5
0 1 1 260 280 0.001 521.40
1 1 1 390 900 0.020 963.30
2 1 1 102 2 0.010 4.80
3 2 1 65 9 0.130 13.10
4 2 1 515 356 0.002 532.20
5 2 1 354 1087 0.119 1502.30
6 3 1 1190 46 0.058 12.00
7 3 1 1985 1 0.006 1.02
8 3 1 25 3 0.001 178.20
9 4 1 261 1792 0.196 100.70
10 4 1 12 0 0.000 12.60
11 4 1 1756 28 0.006 23.70
12 4 1 592 291 0.357 251.90