Computing MAD(mean absolute deviation) GroupBy Pandas - python

I have a dataframe:
Type Name Cost
A X 545
B Y 789
C Z 477
D X 640
C X 435
B Z 335
A X 850
B Y 152
I have all such combinations in my dataframe with Type ['A','B','C','D'] and Names ['X','Y','Z'] . I used the groupby method to get stats on a specific combination together like A-X , A-Y , A-Z .Here's some code:
df = pd.DataFrame({'Type':['A','B','C','D','C','B','A','B'] ,'Name':['X','Y','Z','X','X','Z','X','Y'], 'Cost':[545,789,477,640,435,335,850,152]})
df.groupby(['Name','Type']).agg([mean,std])
#need to use mad instead of std
I need to eliminate the observations that are more than 3 MADs away ; something like:
test = df[np.abs(df.Cost-df.Cost.mean())<=(3*df.Cost.mad())]
I am confused with this as df.Cost.mad() returns the MAD for the Cost on the entire data rather than a specific Type-Name category. How could I combine both?

You can use groupby and transform to create new data series that can be used to filter out your data.
groups = df.groupby(['Name','Type'])
mad = groups['Cost'].transform(lambda x: x.mad())
dif = groups['Cost'].transform(lambda x: np.abs(x - x.mean()))
df2 = df[dif <= 3*mad]
However, in this case, no row is filtered out since the difference is equal to the mean absolute deviation (the groups have only two rows at most).

You can get your aggregate function on the grouped object:
df["mad"] = df.groupby(['Name','Type'])["Cost"].transform("mad")
df = df.loc[df.mad<3]

Related

How to perform the computation of a new column for a dataframe with the calculation provided as a string (and avoiding eval())

I have a pandas dataframe like this:
df = pd.DataFrame({'1': [10, 20, 30], 2: [100, 200, 300]})
# 1 2
#0 10 100
#1 20 200
#2 30 300
The goal is to calculate a new column. However, the calculation is provided as a string:
calc = '{1}+{2}'
How can I calculate a new column based on the existing columns and the provided calculation?
What I tried:
My initial idea was to use apply on the dataframe and lambda to make the calculation. Before that I would adjust the calculation string accordingly. However, that would make the use of eval necessary:
for i in range(10):
calc = calc.replace('{'+str(i)+'}', 'row["'+str(i)+'"]')
# outputs calc = 'row["1"]+row["2"]'
df['new_col'] = df.apply(lambda row: eval(calc), axis=1)
# basically: df.apply(lambda row: eval('row["1"]+row["2"]'), axis=1)
Since I want to avoid eval, I am looking for a different solution.
You could use pandas' eval method, but you would need to remove the curly brackets and you cannot have numerical column names.
One option would be to adapt the string to add a prefix (e.g. col) using a regex:
calc = '{1}+{2}'
import re
query = re.sub('{([^}]+)}', r'col\1', calc)
# col1+col2
df['new_col'] = df.add_prefix('col').eval(query)
output:
1 2 new_col
0 10 100 110
1 20 200 220
2 30 300 330

How do I divide a dataframe based on the content in the rows of a column?

I'm trying to get two dataframes out of one. The dataframe has two set of words (neutral and non neutral) so I need to divide it in a dataset that only has neutral words and another that only has non neutral words (mantaining all the rows and columns). These words are in a column called PALABRA.
This is an example of the words in a variable (they are a lot more than these):
neutral_words = ('CAR','CLOUD','SUN')
nonneutral_words = ('ACCIDENT','BUG','BURN')
The df looks like this:
PRESSEDKEY PALABRA COLOR KEYCORR RT CORRECT
90 v BURN red r 496 N
96 v SUN red r 1307 N
102 v BUG red r 0 N
108 v CLOUD blue a 168 N
114 v ACCIDENT green v 73 Y
This way, I need to divide the dataframe in df1 with neutral_words only and df2 with nonneutral_words. How can I do this?
You can use isin:
df1 = df.loc[df['PALABRA'].isin(neutral_words)]
df2 = df.loc[df['PALABRA'].isin(nonneutral_words)]
I think you'll want to use the isin function.
Something like:
df1=df[df.Palabra.isin(['ACCIDENT','BUG','BURN'])])
or
df1=df[df.Palabra.isin(nonneutral_words)

How to Ignore a few rows with a unique index in a pandas data frame while using groupby()?

I have a data frame df:
ID Height
A 168
A 170
A 190
A 159
B 172
B 173
C 185
I am trying to eliminate outliers in df from each ID separately using:
outliersfree = df[df.groupby("ID")['Height'].transform(lambda x : x < (x.quantile(0.95) + 5*(x.quantile(0.95) - x.quantile(0.05)))).eq(1)]
Here, I want to ignore the rows with a unique index. i.e., all the IDs that have only one corresponding entry in them. For instance, in the df given, C index has only one entry. Hence, I want to ignore C while eliminating outliers and present as it is n the new data frame formed outliersfree.
I am also interested in knowing how to ignore/skip IDs which have two entries (For example, B in the df).
One option is to create an OR condition in your lambda function such that if there is one element in your group, you return True.
df.groupby("ID")['Height'].transform(lambda x : (x.count() == 1) |
(x < (x.quantile(0.95) + 5*
(x.quantile(0.95) - x.quantile(0.05)))))
And you can use (x.count() < 3) for groups with two or less.

Create a new column based on calculations that change between rows?

I would like to calculate a sum of variables for a given day. Each day contains a different calculation, but all the days use the variables consistently.
There is a df which specifies my variables and a df which specifies how calculations will change depending on the day.
How can I create a new column containing answers from these different equations?
import pandas as pd
import numpy as np
conversion = [["a",5],["b",1],["c",10]]
conversion_table = pd.DataFrame(conversion,columns=['Variable','Cost'])
data1 = [[1,"3a+b"],[2,"c"],[3,"2c"]]
to_solve = pd.DataFrame(data1,columns=['Day','Q1'])
desired = [[1,16],[2,10],[3,20]]
desired_table=pd.DataFrame(desired,columns=['Day','Q1 solved'])
I have separated my variables and equations based on row. Can I loop though these equations to find non-numerics and re-assign them?
#separate out equations and values
for var in conversion_table["Variable"]:
cost=(conversion_table.loc[conversion_table['Variable'] == var, 'Cost']).mean()
for row in to_solve["Q1"]:
equation=row
A simple suggestion, perhaps you need to rewrite a part of your code. Not sure if your want something like this:
a = 5
b = 1
c = 10
# Rewrite the equation that is readable by Python
# e.g. replace 3a+b by 3*a+b
data1 = [[1,"3*a+b"],
[2,"c"],
[3,"2*c"]]
desired_table = pd.DataFrame(data1,
columns=['Day','Q1'])
desired_table['Q1 solved'] = desired_table['Q1'].apply(lambda x: eval(x))
desired_table
Output:
Day Q1 Q1 solved
0 1 3*a+b 16
1 2 c 10
2 3 2*c 20
If it's possible to have the equations changed to equations with * then you could do this.
Get the mapping from the
mapping = dict(zip(conversion_table['Variable'], conversion_table['Cost'])
the eval the function and replace variables with numeric from the mapping
desired_table['Q1 solved'] = to_solve['Q1'].map(lambda x: eval(''.join([str(mapping[i]) if i.isalpha() else str(i) for i in x])))
0 16
1 10
2 20

Pandas: Filter dataframe for values that are too frequent or too rare

On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.
But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns: city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval.
import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in
'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval
0 f p a n
1 k b a f
2 q s n j
3 h c g u
4 w d m h
If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if time of day = night occurs only 3% of the time, or if foot mode of transport is rare, and so on.
I want to remove all such values from all columns (or a list of columns). One idea I have is to do a value_counts on every column, transform and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?
This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.
This answer is similar to that provided by #Ami Tavory, but with a few subtle differences:
It normalizes the value counts so you can just use a percentile threshold.
It calculates counts just once per column instead of twice. This results in faster execution.
Code:
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
Code timing:
df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True),
columns=list('ABCD'))
%%timeit df=df2.copy()
threshold = 0.03
for col in df:
counts = df[col].value_counts(normalize=True)
df = df.loc[df[col].isin(counts[counts > threshold].index), :]
1 loops, best of 3: 485 ms per loop
%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
1 loops, best of 3: 688 ms per loop
I would go with one of the following:
Option A
m = 0.03 * len(df)
df[np.all(
df.apply(
lambda c: c.isin(c.value_counts()[c.value_counts() > m].index).as_matrix()),
axis=1)]
Explanation:
m = 0.03 * len(df) is the threshold (it's nice to take the constant out of the complicated expression)
df[np.all(..., axis=1)] retains the rows where some condition was obtained across all columns.
df.apply(...).as_matrix applies a function to all columns, and makes a matrix of the results.
c.isin(...) checks, for each column item, whether it is in some set.
c.value_counts()[c.value_counts() > m].index is the set of all values in a column whose count is above m.
Option B
m = 0.03 * len(df)
for c in df.columns:
df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]
The explanation is similar to the one above.
Tradeoffs:
Personally, I find B more readable.
B creates a new DataFrame for each filtering of a column; for large DataFrames, it's probably more expensive.
I am new to Python and using Pandas. I came up with the following solution below. Maybe other people might have a better or more efficient approach.
Assuming your DataFrame is DF, you can use the following code below to filter out all infrequent values. Just be sure to update the col and bin_freq variable. DF_Filtered is your new filtered DataFrame.
# Column you want to filter
col = 'time of day'
# Set your frequency to filter out. Currently set to 5%
bin_freq = float(5)/float(100)
DF_Filtered = pd.DataFrame()
for i in DF[col].unique():
counts = DF[DF[col]==i].count()[col]
total_counts = DF[col].count()
freq = float(counts)/float(total_counts)
if freq > bin_freq:
DF_Filtered = pd.concat([DF[DF[col]==i],DF_Filtered])
print DF_Filtered
DataFrames support clip_lower(threshold, axis=None) and clip_upper(threshold, axis=None), which remove all values below or above (respectively) a certain threshhold.
We can also replace all the rare categories with one label, say "Rare" and remove later if this doesn't add value to prediction.
# function finds the labels that are more than certain percentage/threshold
def get_freq_labels(df, var, rare_perc):
df = df.copy()
tmp = df.groupby(var)[var].count() / len(df)
return tmp[tmp > rare_perc].index
vars_cat = [val for val in data.columns if data[val].dtype=='O']
for var in vars_cat:
# find the frequent categories
frequent_cat = get_freq_labels(data, var, 0.05)
# replace rare categories by the string "Rare"
data[var] = np.where(data[var].isin(
frequent_cat ), data[var], 'Rare')

Categories

Resources