Drop groups whose variance is zero - python

Suppose the next df:
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[0, 0.001, 0, 0, 0, np.nan],
'age':[24, 22, 45, np.nan, 60, 32]}
df=pd.DataFrame(d)
The idea is to get the variance for an specific column by group (in this case by: country, level and job title), then select the segments whose variance is below certain threshold and drop them from the original df.
However when applied:
# define variance threshold
threshold = 0.0000000001
# get the variance by group for specific column
group_vars=df.groupby(['country', 'level', 'job title']).var()['number']
# select the rows to drop
rows_to_drop = df[group_vars<threshold].index
# drop the rows in place
#df.drop(rows_to_drop, axis=0, inplace=True)
The next error arises:
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
Expected dataframe would drop: Poland A00 Sales Director 0.000000e+00 for all months , as it is a segment with zero-variance.
Is it possible to reindex group_vars in order to drop it from original df?
What am I missing?

You can achieve this with transform
# define variance threshold
threshold = 0.0000000001
# get the variance by group for specific column
group_vars=df.groupby(['country', 'level', 'job title'])['number'].transform('var')
# select the rows to drop
rows_to_drop = df[group_vars<threshold].index
# drop the rows in place
df.drop(rows_to_drop, axis=0, inplace=True)
Which gives:
month country level job title number age
0 01/01/2020 Japan A01 Insights Manager 0.000 24.0
1 01/02/2020 Japan A01 Insights Manager 0.001 22.0
2 01/03/2020 Japan A01 Insights Manager 0.000 45.0

Related

Speeding up for-loops using pandas for feature engineering

I have a dataframe with the following headings:
payer
recipient_country
date of payment
Each rows shows a transaction, and a row (Bob,UK,1st January 2023) shows that a payer Bob sent a payment to the UK on 1st January 2023.
For each row in this table I need to find the number of times that the payer for that row has sent a payment to the country for that row in the past. So for the row above I would want to find the number of times that Bob has sent money to the UK prior to 1st January 2023.
This is for feature engineering purposes.
I have done this using a for loop in which I iterate through rows and do a pandas loc call for each row to find rows with an earlier date with the same payer and country, but this is far too slow for the number of rows I have to process.
Can anyone think of a way to speed up this process using some fast pandas functions?
Thanks!
Testing on this toy data frame:
df = pd.DataFrame(
[{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-01 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-02 00:00:00')},
{'name': 'Bob', 'country': 'UK', 'date': Timestamp('2023-01-03 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-04 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-05 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-06 00:00:00')},
{'name': 'Cob', 'country': 'UK', 'date': Timestamp('2023-01-07 00:00:00')}]
)
Just group by and cumulatively count:
>>> df['trns_bf'] = df.sort_values(by='date').groupby(['name', 'country'])['name'].cumcount()
name country date trns_bf
0 Bob UK 2023-01-01 0
1 Bob UK 2023-01-02 1
2 Bob UK 2023-01-03 2
3 Cob UK 2023-01-04 0
4 Cob UK 2023-01-05 1
5 Cob UK 2023-01-06 2
6 Cob UK 2023-01-07 3
You need to sort first, to ensure that elements before are not confused with elements after. I interpreted "prior" in your question literally: eg there are no transactions before Bob's transaction to the UK on 1 Jan 2023.
Each row gets its own count for transactions with that name to that country before that date. If there are multiple transactions on one day, determine how you want to deal with that. I would probably use another group by and select the maximum value for that day: df.groupby(['name', 'country', 'date'], as_index=False)['trns_bf'].max() and then merge the result back (indexing will make it difficult to attach directly as above).

how to drop pandas df columns whose variance is in a tolerance rank?

Suppose the next df:
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[0, 0.001, 0, 0, 0, 0],
'age':[24, 22, 45, 38, 60, 32]}
df=pd.DataFrame(d)
When trying to get the variance for all columns the next result arises:
import pandas as pd
df.agg("var")
Result:
number 1.666667e-07
age 2.025667e+02
dtype: float64
The idea is to remove columns which are in a certain rank, say, drop it if column variance
is in a rank between 0 and 0.0001, (i.e: drop number column as it's variance is within this rank).
When tried to do this:
df= df.loc[:, 0 < df.std() < .0001]
The next error arises:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Is it possible to drop pandas df column whose variance is within a tolerance rank?
Another solution (using .between + .drop(columns=...))
var = df.agg("var", numeric_only=True)
df = df.drop(columns=var[var.between(0, 0.0001)].index)
print(df)
Prints:
month country level job title age
0 01/01/2020 Japan A01 Insights Manager 24
1 01/02/2020 Japan A01 Insights Manager 22
2 01/03/2020 Japan A01 Insights Manager 45
3 01/01/2020 Poland A00 Sales Director 38
4 01/02/2020 Poland A00 Sales Director 60
5 01/03/2020 Poland A00 Sales Director 32
You can't use chained comparison operator with pandas series, since under the hood, they are translated to and which only works with scalars. Use vectorized & instead:
df[(((v := df.var()) > 0) & (v < 0.0001))[lambda x: x].index]
number
0 0.000
1 0.001
2 0.000
3 0.000
4 0.000
5 0.000

How to sort a table by mean after using groupby function?

I have a table that was created with a groupby function, and I want to sort it from highest mean to lowest mean. However, I keep getting the error message: "'DataFrameGroupBy' object has no attribute 'sort_values'" or sometimes "bool object is not callable".
import pandas as pd
import numpy as np
df = pd.read_csv("Listings.csv")
df2 = df[df['city'].str.contains("Cape Town")]
df2_by_neighbourhood = df2.groupby('neighbourhood')
df2_by_neighbourhood.describe()
df2_by_neighbourhood.sort_values(['mean'], ascending=False)
When I get rid of the last line, the table comes out perfectly, but the mean isn't from highest to lowest. It gives me this:
neighbourhood
count
mean
std
min
25%
50%
75%
price max
Ward 1
207
1181
1422
210.0
524.0
750.0
1145
10000
(etc., 93 rows in total)
The data BEFORE using groupby looks like this:
neighbourhood
city
price
Ward 115
Cape Town
700
[19086 rows x 3 columns]
I can't sort the table before using the groupby function, because groupby is how I'm getting the mean values.
You can try:
df2_by_neighbourhood['mean'].shift(0).sort_values(ascending=False)
to get the mean column as a series, in descending order.
If you want to show the whole dataframe df2_by_neighbourhood (instead of just the mean column) in descending order of values of column mean, you can use:
df2_by_neighbourhood.loc[df2_by_neighbourhood['mean'].shift(0).argsort().sort_values(ascending=False)]
Edit
As seen from your newly released sample data, you can take the mean of column price as follows:
df2_by_neighbourhood['price'].mean().sort_values(ascending=False)
Taking your data sample and add some rows, you can see the results as follows:
data = {'neighbourhood': ['Ward 115', 'Ward 115', 'Ward 226', 'Ward 226'],
'city': ['Cape Town', 'Cape Town', 'Cape Town', 'Cape Town'],
'price': [700, 900, 1000, 1200]}
df2 = pd.DataFrame(data)
print(df2)
neighbourhood city price
0 Ward 115 Cape Town 700
1 Ward 115 Cape Town 900
2 Ward 226 Cape Town 1000
3 Ward 226 Cape Town 1200
Run the codes:
df2_by_neighbourhood = df2.groupby('neighbourhood')
df2_by_neighbourhood['price'].mean().sort_values(ascending=False)
Output:
neighbourhood
Ward 226 1100
Ward 115 800
Name: price, dtype: int64
Here, the mean values of price for 2 groups are sorted in descending order.

Tiering pandas column based on unique id and range cutoffs

I have one df that categorizes income into tiers across males and females and thousands of zip codes. I need to add a column to df2 that maps each person's income level by zip code (average, above average etc.).
The idea is to assign the highest cutoff exceeded by a given person's income, or assign to lowest tier by default
The income level for each tier also varies by zip code. For certain zip codes there are limited number of tiers (e.g. no very high incomes). There are also separate tiers for males by zip code not shown due to space.
I think I need to create some sort of dictionary, not sure how to handle this. any help would go a long way, thanks.
**Edit: The first df acts as a key, and I am looking to use it to assign the corresponding row value from the column 'Income Level' to df2
E.g. for a unique id in df2, compare df2['Annual Income'] to the matching id in df['Annual Income cutoff']. Then assign the highest possible Income level from df as a new row value in df2
import pandas as pd
import numpy as np
data = [['female',10009,'very high',10000000],['female',10009,'high',100000],['female',10009,'above average',75000],['female', 10009, 'average', 50000]]
df = pd.DataFrame(data, columns = ['Sex', 'Area Code', 'Income level', 'Annual Income cutoff'])
print(df)
Sex Area Code Income level Annual Income cutoff
0 female 10009 very high 10000000
1 female 10009 high 100000
2 female 10009 above average 75000
3 female 10009 average 50000
data_2 = [['female',10009, 98000], ['female', 10009, 56000]]
df2 = pd.DataFrame(data_2, columns = ['Sex', 'Area Code', 'Annual Income'])
print(df2)
Sex Area Code Annual Income
0 female 10009 98000
1 female 10009 56000
output_data = [['female',10009, 98000, 'above average'], ['female', 10009, 56000, 'average']]
final_output = pd.DataFrame(output_data, columns = ['Sex', 'Area Code', 'Annual Income', 'Income Level'])
print(final_output)
Sex Area Code Annual Income Income Level
0 female 10009 98000 above average
1 female 10009 56000 average
One way to do this is to use pd.merge_asof:
pd.merge_asof(df2.sort_values('Annual Income'),
df.sort_values('Annual Income cutoff'),
left_on = 'Annual Income',
right_on = 'Annual Income cutoff',
by=['Sex', 'Area Code'], direction = 'backward')
Output:
Sex Area Code Annual Income Income level Annual Income cutoff
0 female 10009 56000 average 50000
1 female 10009 98000 average 50000

How to 'Scale Data' in Pandas or any other Python Libraries

I'm analyzing Company Data set that stores 'Company Name', 'Company Profit'. I also have another data set that has '# of Employees', 'Feedback (Negative or Positive)'. I want to analyze do Companies with more Profit Worth have more Positive Employees or not. So the question is I will have 'Company Profit' in millions or billions and number of employees would be quite small.
So, Can I scale the data or do something else here?
Suggestions are welcome.
If you have a table that looks like this:
Company Name Company Profit # of Employees Feedback (Negative or Positive)
0 Alpha 1000000 10 Positive
1 Bravo 13000000 210 Positive
2 Charlie 2300000 16 Negative
3 Delta 130000 1 Negative
and want a table that looks like this:
Company Name Company Profit (Million) # of Employees Feedback (Negative or Positive)
0 Alpha 1.00 10 Positive
1 Bravo 13.00 210 Positive
2 Charlie 2.30 16 Negative
3 Delta 0.13 1 Negative
Then you can use the apply method and a lambda function to rescale the data.
#this part creates the original table
import pandas as pd
columns = ['Company Name', 'Company Profit', '# of Employees', 'Feedback (Negative or Positive)']
df = pd.DataFrame([('Alpha', 1000000, 10, 'Positive'),
('Bravo', 13000000, 210, 'Positive'),
('Charlie', 2300000, 16, 'Negative'),
('Delta', 130000, 1, 'Negative')], columns = columns)
#this part makes the modification
df['Company Profit (Million)'] = df['Company Profit'].apply(lambda x: x/1000000)
df = df [['Company Name', 'Company Profit (Million)', '# of Employees', 'Feedback (Negative or Positive)']]

Categories

Resources