I am using pandas to create pivot tables. My data looks usually contains a lot of numeric values which can easily be aggregated with np.mean (e.g. question1), but there is one exception - Net Promoter Score (notice Total 0.00 both for EU and NA)
responseId country region nps question1
0 1 Germany EU 11 3.2
1 2 Germany EU 10 5.0
2 3 US NA 7 4.3
3 4 US NA 5 4.8
4 5 France EU 5 3.2
5 6 France EU 5 5.0
6 7 France EU 11 5.0
region EU NA
country France Germany Total US Total
nps -33.33 100.0 0.00 -100.00 0.00
question1 4.40 4.1 4.25 4.55 4.55
For NPS I use a custom aggfunc
def calculate_nps(column):
detractors = [1,2,3,4,5,6,7]
passives = [8,9]
promoters = [10,11]
counts = column.value_counts(normalize=True)
percent_promoters = counts.reindex(promoters).sum()
percent_detractors = counts.reindex(detractors).sum()
return (percent_promoters - percent_detractors) * 100
aggfunc = {
"nps": calculate_nps,
"question1": np.mean
}
pd.pivot_table(data=df,columns=["region","country"],values=["nps","question1"],aggfunc=aggfunc,margins=True,margins_name="Total",sort=True)
This aggfunc works fine for regular columns, but fails for margins ("Total" columns), because pandas passes data already aggregated. For regular fields calculate_nps receives columns like this
4 5
5 5
6 11
Name: nps, dtype: int64
but for margins the data looks like this
region country
EU France -33.333333
Germany 100.000000
Name: nps, dtype: float64
calculate_nps cannot deal with such data and returns 0. In this case column.mean() should be applied which I solved like this (notice if column.index.names != [None])
def calculate_nps(column):
if column.index.names != [None]:
return column.mean()
detractors = [1,2,3,4,5,6,7]
passives = [8,9]
promoters = [10,11]
counts = column.value_counts(normalize=True)
percent_promoters = counts.reindex(promoters).sum()
percent_detractors = counts.reindex(detractors).sum()
return (percent_promoters - percent_detractors) * 100
Now the pivot table is correct
region EU NA
country France Germany Total US Total
nps -33.33 100.0 33.33 -100.00 -100.00
question1 4.40 4.1 4.25 4.55 4.55
Question
Is there a proper / better way to determine what kind of data is passed to aggfunc? I'm not sure that my solution will work for all scenarios
Related
I have a dataframe (3.7 million rows) with a column with different country names
id Country
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 INDIA
6 USA
7 USA
8 ITALY
9 USA
10 RUSSIA
I want to replace INDIA and ITALY with "Miscellanous" because they occur less than 15% in the column
My alternate solution is to replace the names with there frequency using
df.column_name = df.column_name.map(df.column_name.value_counts())
Use:
df.loc[df.groupby('Country')['id']
.transform('size')
.div(len(df))
.lt(0.15),
'Country'] = 'Miscellanous'
Or
df.loc[df['Country'].map(df['Country'].value_counts(normalize=True)
.lt(0.15)),
'Country'] = 'Miscellanous'
If you want to put all country whose frequency is less than a threshold into the "Misc" category:
threshold = 0.15
freq = df['Country'].value_counts(normalize=True)
mappings = freq.index.to_series().mask(freq < threshold, 'Misc').to_dict()
df['Country'].map(mappings)
Here is another option
s = df.value_counts()
s = s/s.sum()
s = s.loc[s<.15].reset_index()
df = df.replace(s['Place'].tolist(),'Miscellanous')
You can use dictionary and map for this:
d = df.Country.value_counts(normalize=True).to_dict()
df.Country.map(lambda x : x if d[x] > 0.15 else 'Miscellanous' )
Output:
id
1 RUSSIA
2 USA
3 RUSSIA
4 RUSSIA
5 Miscellanous
6 USA
7 USA
8 Miscellanous
9 USA
10 RUSSIA
Name: Country, dtype: object
i have a 2 dataframes as given below,
import pandas as pd
restaurant = pd.read_excel("C:/Users/Avinash/Desktop/restaurant data.xlsx")
restaurant
Restaurant StartYear Capex inflation_adjusted_capex
Bawarchi Restaurant 1986 6000 Nan
Ks Baker's 1988 2000 Nan
Rajesh Restaurant 1989 1050 Nan
Ahmed Steak House 1990 9000 Nan
Absolute Barbique 1997 9500 Nan
inflation = pd.read_excel("C:/Users/Avinash/Desktop/restaurant data.xlsx", sheet_name="Sheet2")
inflation
Years Inflation_Factor
1985 0.111
1986 0.134
1987 0.191
1988 0.2253
1989 0.265
1990 0.304
Aim: is to fill "inflation_adjusted_capex" with div of "Capex" by corresponding years "Inflation_Factor from second Dataframe.
The code i wrote is,
for i in restaurant["StartYear"]:
restaurant["inflation_adjusted_capex"] =
(restaurant["inflation_adjusted_capex"])/(inflation[inflation["Years"] == i]["Inflation_Factor"])
print(restaurant["inflation_adjusted_capex"])
0 Nan
1 Nan
2 Nan
3 Nan
4 Nan
Name: Inflation adjusted Capex to current year, dtype: float64
Unfortunately this code is returning Nan values, kindly help me. Thanks in advance.
There are a couple ways to do this. The first is to join the dataframes so that you have your inflation factors in the first dataframe, and then do the calculation:
#add inflation_factor column to first dataframe
restaurant = restaurant.merge(inflation, left_on = 'StartYear', right_on = 'Year')
#do dividsion
restaurant['inflation_adjusted_capex'] = restaurant['Capex']/restaurant['Inflation_Factor']
The other is to apply a function that behaves like an excel VLOOKUP:
#set year as index for inflation so we can look up based on it
inflation = inflation.set_index('Year')
#look up inflation factor and divide with a lambda function
restaurant['inflation_adjusted_capex'] = inflation.apply(lambda row: row['Capex']/inflation['Inflation_Factor'][row['StartYear']], 1)
I have data frame with missing values:
import pandas as pd
data = {'Brand':['residential','unclassified','tertiary','residential','unclassified','primary','residential'],
'Price': [22000,25000,27000,"NA","NA",10000,"NA"]
}
df = pd.DataFrame(data, columns = ['Brand', 'Price'])
print (df)
Resulting in this data frame:
Brand Price
0 residential 22000
1 unclassified 25000
2 tertiary 27000
3 residential NA
4 unclassified NA
5 primary 10000
6 residential NA
I would like to fill in the missing values for residential and unclassified in the prices column with fixed values (residential =1000, unclassified=2000), however I dont want to lose any values that are already present in the prices column for residential or unclassified, so the out put should look like this:
Brand Price
0 residential 22000
1 unclassified 25000
2 tertiary 27000
3 residential 1000
4 unclassified 2000
5 primary 10000
6 residential 1000
Whats the easiest way to get this done
We can do map with fillna , PS: you need to make sure in your df, NA is NaN
df.Price.fillna(df.Brand.map({'residential':1000,'unclassified':2000}),inplace=True)
df
Brand Price
0 residential 22000.0
1 unclassified 25000.0
2 tertiary 27000.0
3 residential 1000.0
4 unclassified 2000.0
5 primary 10000.0
6 residential 1000.0
I have a datafarme which looks like as follows (there are more columns having been dropped off):
memberID shipping_country
264991
264991 Canada
100 USA
5000
5000 UK
I'm trying to fill the blank cells with existing value of shipping country for each user:
memberID shipping_country
264991 Canada
264991 Canada
100 USA
5000 UK
5000 UK
However, I'm not sure what's the most efficient way to do this on a large scale dataset. Perhaps, using a vectored groupby method?
You can use GroupBy + ffill / bfill:
def filler(x):
return x.ffill().bfill()
res = df.groupby('memberID')['shipping_country'].apply(filler)
A custom function is necessary as there's no combined Pandas method to ffill and bfill sequentially.
This also caters for the situation where all values are NaN for a specific memberID; in this case they will remain NaN.
For the following sample dataframe (I added a memberID group that only contains '' in the shipping_country column):
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 54
This should work for you, and also as the behavior that if a memberID group only has empty string values ('') in shipping_country, those will be retained in the output df:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first').fillna('')
Yields:
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 54
If you would like to leave the empty strings '' as NaN in the output df, then just remove the fillna(''), leaving:
df['shipping_country'] = df.replace('',np.nan).groupby('memberID')['shipping_country'].transform('first')
You can use chained groupbys, one with forward fill and one with backfill:
# replace blank values with `NaN` first:
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
This method will also allow a group made up of all NaN to remain NaN:
>>> df
memberID shipping_country
0 264991
1 264991 Canada
2 100 USA
3 5000
4 5000 UK
5 1
6 1
df['shipping_country'].replace('',pd.np.nan,inplace=True)
df.iloc[::-1].groupby('memberID').ffill().groupby('memberID').bfill()
memberID shipping_country
0 264991 Canada
1 264991 Canada
2 100 USA
3 5000 UK
4 5000 UK
5 1 NaN
6 1 NaN
Hi I am trying to assign certain values in columns of a dataframe.
# Count the number of title counts
full.groupby(['Sex', 'Title']).Title.count()
Sex Title
female Dona 1
Dr 1
Lady 1
Miss 260
Mlle 2
Mme 1
Mrs 197
Ms 2
the Countess 1
male Capt 1
Col 4
Don 1
Dr 7
Jonkheer 1
Major 2
Master 61
Mr 757
Rev 8
Sir 1
Name: Title, dtype: int64
My tail of dataframe looks like follows:
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket Title
413 NaN NaN S 8.0500 Spector, Mr. Woolf 0 1305 3 male 0 NaN A.5. 3236 Mr
414 39.0 C105 C 108.9000 Oliva y Ocana, Dona. Fermina 0 1306 1 female 0 NaN PC 17758 Dona
415 38.5 NaN S 7.2500 Saether, Mr. Simon Sivertsen 0 1307 3 male 0 NaN SOTON/O.Q. 3101262 Mr
416 NaN NaN S 8.0500 Ware, Mr. Frederick 0 1308 3 male 0 NaN 359309 Mr
417 NaN NaN C 22.3583 Peter, Master. Michael J 1 1309 3 male 1 NaN 2668 Master
The name of my dataframe is full and I want to change names of Title.
Here is the following code I wrote :
# Create a variable rate_title to modify the names of Title
rare_title = ['Dona', "Lady", "the Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev", "Sir", "Jonkheer"]
# Also reassign mlle, ms, and mme accordingly
full[full.Title == "Mlle"].Title = "Miss"
full[full.Title == "Ms"].Title = "Miss"
full[full.Title == "Mme"].Title = "Mrs"
full[full.Title.isin(rare_title)].Title = "Rare Title"
I also tried the following code in pandas:
full.loc[full['Title'] == "Mlle", ['Sex', 'Title']] = "Miss"
Still the dataframe is not changed. Any help is appreciated.
Use loc based indexing and set matching row values -
miss = ['Mlle', 'Ms', 'Mme']
rare_title = ['Dona', "Lady", ...]
df.loc[df.Title.isin(miss), 'Title'] = 'Miss'
df.loc[df.Title.isin(rare_title), 'Title'] = 'Rare Title'