This question already has an answer here:
applying pandas cut within a groupby
(1 answer)
Closed 1 year ago.
I am trying to group a set of things and perform cuts within the groups dynamically based on the min, max and average of both (min and max) value.
My dataset looks something like this:
Country Value
Uganda 210
Kenya 423
Kenya 315
Tanzania 780
Uganda 124
Uganda 213
Tanzania 978
Kenya 524
What I expect is in which range does each value fall, above or below mid-value:
Country Value Range
Uganda 210 (168.5, 213)
Uganda 124 (124, 168.5)
Uganda 213 (168.5, 213)
Kenya 423 (419.5, 524)
Kenya 315 (315, 419.5)
Kenya 524 (419.5, 524)
Tanzania 780 (780, 879)
Tanzania 978 (879, 980)
I am able to achieve this if I am doing it with a loop iterating over each group. I am also able to achieve the cuts based on the min and max value over the entire dataset but not individual groups. However, I was wondering if it can be done in a line or two using pandas and not use loops.
This is how I did it:
df['range'] = df.groupby('country')[['value']].transform(lambda x: pd.cut(x, bins = 2).astype(str))
Try this;
data['Range'] = data.groupby('Country').Value.apply(pd.cut, bins=2)
Related
I'm starting with a dataframe of baseabll seasons a section of which looks similar to this:
Name Season AB H SB playerid
13047 A.J. Pierzynski 2013 503 137 1 746
6891 A.J. Pierzynski 2006 509 150 1 746
1374 Rod Carew 1977 616 239 23 1001942
1422 Stan Musial 1948 611 230 7 1009405
1507 Todd Helton 2000 580 216 5 432
1508 Nomar Garciaparra 2000 529 197 5 190
1509 Ichiro Suzuki 2004 704 262 36 1101
From these seasons, I want to create a dataframe of career stats; that is, one row for each player which is a sum of their AB, H, etc. This dataframe should still include the names of the players. The playerid in the above is a unique key for each player and should either be an index or an unchanged value in a column after creating the career stats dataframe.
My hypothetical starting point is df_careers = df_seasons.groupby('playerid').agg(sum) but this leaves out all the non-numeric data. With numeric_only = False I can get some sort of mess in the names columns like 'Ichiro SuzukiIchiro SuzukiIchiro Suzuki' from concatenation, but that just requires a bunch of cleaning. This is something I'd like to be able to do with other data sets and the actually data I have is more like 25 columns, so I'd rather understand a specific routine for getting the Name data back or preserving it from the outset rather than write a specific function and use groupby('playerid').agg(func) (or a similar process) to do it, if possible.
I'm guessing there's a fairly simply way to do this, but I only started learning Pandas a week ago, so there are gaps in my knowledge.
You can write your own condition how do you want to include non summed columns.
col = df.columns.tolist()
col.remove('playerid')
df.groupby('playerid').agg({i : lambda x: x.iloc[0] if x.dtypes=='object' else x.sum() for i in df.columns})
df:
Name Season AB H SB playerid
playerid
190 Nomar_Garciaparra 2000 529 197 5 190
432 Todd_Helton 2000 580 216 5 432
746 A.J._Pierzynski 4019 1012 287 2 1492
1101 Ichiro_Suzuki 2004 704 262 36 1101
1001942 Rod_Carew 1977 616 239 23 1001942
1009405 Stan_Musial 1948 611 230 7 1009405
If there is a one-to-one relationship between 'playerid' and 'Name', as appears to be the case, you can just include 'Name' in the groupby columns:
stat_cols = ['AB', 'H', 'SB']
groupby_cols = ['playerid', 'Name']
results = df.groupby(groupby_cols)[stat_cols].sum()
Results:
AB H SB
playerid Name
190 Nomar Garciaparra 529 197 5
432 Todd Helton 580 216 5
746 A.J. Pierzynski 1012 287 2
1101 Ichiro Suzuki 704 262 36
1001942 Rod Carew 616 239 23
1009405 Stan Musial 611 230 7
If you'd prefer to group only by 'playerid' and add the 'Name' data back in afterwards, you can instead create a 'playerId' to 'Name' mapping as a dictionary, and look it up using map:
results = df.groupby('playerid')[stat_cols].sum()
name_map = pd.Series(df.Name.to_numpy(), df.playerid).to_dict()
results['Name'] = results.index.map(name_map)
Results:
AB H SB Name
playerid
190 529 197 5 Nomar Garciaparra
432 580 216 5 Todd Helton
746 1012 287 2 A.J. Pierzynski
1101 704 262 36 Ichiro Suzuki
1001942 616 239 23 Rod Carew
1009405 611 230 7 Stan Musial
groupy.agg() can accept a dictionary that maps column names to functions. So, one solution is to pass a dictionary to agg, specifying which functions to apply to each column.
Using the sample data above, one might use
mapping = { 'AB': sum,'H': sum, 'SB': sum, 'Season': max, 'Name': max }
df_1 = df.groupby('playerid').agg(mapping)
The choice to use 'max' for those that shouldn't be summed is arbitrary. You could define a lambda function to apply to a column if you want to handle it in a certain way. DataFrameGroupBy.agg can work with any function that will work with DataFrame.apply.
To expand this to larger data sets, you might use a dictionary comprehension. This would work well:
dictionary = { x : sum for x in df.columns}
dont_sum = {'Name': max, 'Season': max}
dictionary.update(dont_sum)
df_1 = df.groupby('playerid').agg(dictionary)
In the titanic dataset, I wish to calculate the percentage of passengers who survived with each of Passenger class (Pclass) 1,2 & 3. I figured out how to get the count of passengers and no. of passengers who survived using group by as below:
train[['PassengerId','Pclass','Survived']]\
.groupby('Pclass')\
.agg(PassengerCount=pd.NamedAgg(column='PassengerId', aggfunc='count'),
SurvivedPassengerCount=pd.NamedAgg(column='Survived',aggfunc='sum'))
So, I get the below output:
PassengerCount SurvivedPassengerCount
Pclass
1 216 136
2 184 87
3 491 119
But how do I get a percentage column? I want the output as below:
PassengerCount SurvivedPassengerCount PercSurvived
Pclass
1 216 136 62.9%
2 184 87 47.3%
3 491 119 24.2%
Thanks in advance!
Since you only need to divide SurvivedPassengerCount by PassengerCount, you can do this using the .assign method:
result = train[['PassengerId','Pclass','Survived']]\
.groupby('Pclass')\
.agg(PassengerCount=pd.NamedAgg(column='PassengerId', aggfunc='count'),
SurvivedPassengerCount=pd.NamedAgg(column='Survived',aggfunc='sum'))\
result = result.assign(PercSurvived=df['PassengerCount']/df['SurvivedPassengerCount'])
I have a data frame like below:
i_id q_id
month category_bucket
Aug Algebra Tutoring 187 64
Balloon Artistry 459 401
Carpet Installation or Replacement 427 243
Dance Lessons 181 46
Landscaping 166 60
Others 9344 4987
Tennis Instruction 161 61
Tree and Shrub Service 383 269
Wedding Photography 161 49
Window Repair 140 80
Wiring 439 206
July Algebra Tutoring 555 222
Balloon Artistry 229 202
Carpet Installation or Replacement 140 106
Dance Lessons 354 115
Landscaping 511 243
Others 9019 4470
Tennis Instruction 613 324
Tree and Shrub Service 130 100
Wedding Photography 425 191
Window Repair 444 282
Wiring 154 98
It's a multi-index data frame with month and category bucket as index. And i_id, q_id as columns
I got this by doing a groupby operation on a normal data frame like below
invites_combined.groupby(['month', 'category_bucket'])[["i_id","q_id"]].count()
I basically want a data frame where I have 4 columns 2 each for i_id, q-Id for both the months and a column for category_bucket. So basically converting the above multi-index data frame to single index so that I can access the values.
Currently it's difficult for me to access the values of i_id, q_id along for a particular month and category value.
If you feel there is an easier way to access the i_id and q_id values for each category and month without having to convert to single index that is fine too.
Single index would be easier to loop into each value for each combination of month and category though.
It seems you need reset_index for convert MultiIndex to columns:
df = df.reset_index()
I would like to shift a column in a Pandas DataFrame, but I haven't been able to find a method to do it without losing values.
(This post is quite similar to How to shift a column in Pandas DataFrame but the validated answer doesn't give the desired output and I can't comment it).
Does anyone know how to do it?
## x1 x2
##0 206 214
##1 226 234
##2 245 253
##3 265 272
##4 283 291
Desired output:
## x1 x2
##0 206 nan
##1 226 214
##2 245 234
##3 265 253
##4 283 271
##5 nan 291
Use loc to add a new blank row to the DataFrame, then perform the shift.
df.loc[max(df.index)+1, :] = None
df.x2 = df.x2.shift(1)
The code above assumes that your index is integer based, which is the pandas default. If you're using a non-integer based index, replace max(df.index)+1 with whatever you want the new last index to be.
My Problem I'm Trying To Solve
I have 11 months worth of performance data:
Month Branded Non-Branded Shopping Grand Total
0 2/1/2015 1330 334 161 1825
1 3/1/2015 1344 293 197 1834
2 4/1/2015 899 181 190 1270
3 5/1/2015 939 208 154 1301
4 6/1/2015 1119 238 179 1536
5 7/1/2015 859 238 170 1267
6 8/1/2015 996 340 183 1519
7 9/1/2015 1138 381 172 1691
8 10/1/2015 1093 395 176 1664
9 11/1/2015 1491 426 199 2116
10 12/1/2015 1539 530 156 2225
Let's say it's February, 1 2016 and I'm asking "are the results in January statistically different from the past 11 months?"
Month Branded Non-Branded Shopping Grand Total
11 1/1/2016 1064 408 106 1578
I came across a blog...
I came across iaingallagher's blog. I will reproduce here (in case the blog goes down).
1-sample t-test
The 1-sample t-test is used when we want to compare a sample mean to a
population mean (which we already know). The average British man is
175.3 cm tall. A survey recorded the heights of 10 UK men and we want to know whether the mean of the sample is different from the
population mean.
# 1-sample t-test
from scipy import stats
one_sample_data = [177.3, 182.7, 169.6, 176.3, 180.3, 179.4, 178.5, 177.2, 181.8, 176.5]
one_sample = stats.ttest_1samp(one_sample_data, 175.3)
print "The t-statistic is %.3f and the p-value is %.3f." % one_sample
Result:
The t-statistic is 2.296 and the p-value is 0.047.
Finally, to my question...
In iaingallagher's example, he knows the population mean and is comparing a sample (one_sample_data). In MY example, I want to see if 1/1/2016 is statistically different from the previous 11 months. So in my case, the previous 11 months is an array (instead of a single population mean value) and my sample is one data point (instead of an array)... so it's kind of backwards.
QUESTION
If I was focused on the Shopping column data:
Will scipy.stats.ttest_1samp([161,197,190,154,179,170,183,172,176,199,156], 106) produce a valid result even though my sample (first parameters) is a list of previous results and I'm comparing it to a popmean that's not the population mean but instead one sample.
If this is not the correct stats function, any recommendation on what to use for this hypothesis test situation?
If you are only interested in the "Shopping" column, try to create a .xlsx or .csv file containing the data from only the "Shopping"column.
This way you could import this data and make use of pandas to perform the same T-test for each column individually.
import pandas as pd
from scipy import stats
data = pd.read_excel("datafile.xlxs")
one_sample_data = data["Shopping"]
one_sample = stats.ttest_1samp(one_sample_data, 175.3)