How to 'Scale Data' in Pandas or any other Python Libraries - python

I'm analyzing Company Data set that stores 'Company Name', 'Company Profit'. I also have another data set that has '# of Employees', 'Feedback (Negative or Positive)'. I want to analyze do Companies with more Profit Worth have more Positive Employees or not. So the question is I will have 'Company Profit' in millions or billions and number of employees would be quite small.
So, Can I scale the data or do something else here?
Suggestions are welcome.

If you have a table that looks like this:
Company Name Company Profit # of Employees Feedback (Negative or Positive)
0 Alpha 1000000 10 Positive
1 Bravo 13000000 210 Positive
2 Charlie 2300000 16 Negative
3 Delta 130000 1 Negative
and want a table that looks like this:
Company Name Company Profit (Million) # of Employees Feedback (Negative or Positive)
0 Alpha 1.00 10 Positive
1 Bravo 13.00 210 Positive
2 Charlie 2.30 16 Negative
3 Delta 0.13 1 Negative
Then you can use the apply method and a lambda function to rescale the data.
#this part creates the original table
import pandas as pd
columns = ['Company Name', 'Company Profit', '# of Employees', 'Feedback (Negative or Positive)']
df = pd.DataFrame([('Alpha', 1000000, 10, 'Positive'),
('Bravo', 13000000, 210, 'Positive'),
('Charlie', 2300000, 16, 'Negative'),
('Delta', 130000, 1, 'Negative')], columns = columns)
#this part makes the modification
df['Company Profit (Million)'] = df['Company Profit'].apply(lambda x: x/1000000)
df = df [['Company Name', 'Company Profit (Million)', '# of Employees', 'Feedback (Negative or Positive)']]

Related

Columns selection on specific text

I want to extract specific columns that contain specific names. Below you can see my data
import numpy as np
import pandas as pd
data = {
'Names': ['Store (007) Total amount of Sales ',
'Store perc (65) Total amount of sales ',
'Mall store, aid (005) Total amount of sales',
'Increase in the value of sales / Additional seling (22) Total amount of sales',
'Dividends (0233) Amount of income tax',
'Other income (098) Total amount of Sales',
'Other income (0245) Amount of Income Tax',
],
'Sales':[10,10,9,7,5,5,5],
}
df = pd.DataFrame(data, columns = ['Names',
'Sales',
])
df
This data have some specific columns that I need to be selected in the separate data frame. Keywords for this selection are words Total amount of Sales or Total amount of sales . These words are placed after the second brackets ). Also please take into account that text is no trimmed so empty spaces are possible.
So can anybody help me how to solve this ?
Use Series.str.contains without test cases with case=False in boolean indexing:
df1 = df[df['Names'].str.contains('Total amount of Sales', case=False)]
print (df1)
Names Sales
0 Store (007) Total amount of Sales 10
1 Store perc (65) Total amount of sales 10
2 Mall store, aid (005) Total amount of sales 9
3 Increase in the value of sales / Additional se... 7
5 Other income (098) Total amount of Sales 5
Or if need test sales or Sales use:
df2 = df[df['Names'].str.contains('Total amount of [Ss]ales')]

Create new rows in a Pandas Dataframe based on a column from another pandas dataframe

I have a dataframe DF1 which looks like this:
Account Name
Task Type
Flag
Cost
Account 1
Repair
True
$100
Account 2
Repair
True
$200
Account 3
Repair
False
$300
DF2 looks like this:
Country
Percentage
US
30%
Canada
20%
India
50%
I want to create DF3 based on DF1 & DF2 by doing the following:
Filter rows with where the Flag = True
Create a new column 'Calculated_Cost' which will multiply the 'Cost' column in DF1 with percentage column of DF2 & create multiple rows based on the number of rows in DF2
The Final output would look like this:
Account Name
Task Type
Flag
Cost
Country
Calculated_Cost
Account 1
Repair
True
$100
US
$30
Account 1
Repair
True
$100
Canada
$20
Account 1
Repair
True
$100
India
$50
Account 2
Repair
True
$200
US
$60
Account 2
Repair
True
$200
Canada
$40
Account 2
Repair
True
$200
India
$100
Account 3
Repair
False
$300
Nan
Nan
Use:
df1['Cost'] = df1['Cost'].str.lstrip('$').astype(int)
df2['Percentage'] = df2['Percentage'].str.rstrip('%').astype(int).div(100)
df = pd.concat([df1[df1['Flag']].merge(df2, how='cross'), df1[~df1['Flag']]])
df['Calculated_Cost'] = df['Cost'].mul(df.pop('Percentage'))
print (df)
Account Name Task Type Flag Cost Country Calculated_Cost
0 Account 1 Repair True 100 US 30.0
1 Account 1 Repair True 100 Canada 20.0
2 Account 1 Repair True 100 India 50.0
3 Account 2 Repair True 200 US 60.0
4 Account 2 Repair True 200 Canada 40.0
5 Account 2 Repair True 200 India 100.0
2 Account 3 Repair False 300 NaN NaN
I am sure there is a more efficient way to do this... but I got it done using the following code:
import pandas as pd
df1 = pd.DataFrame(
{
'Account Name': ['Account 1', 'Account 2', 'Account 3'],
'Task Type': ['Repair', 'Repair', 'Repair'],
'Flag': ['True', 'True', 'False'],
'Cost': ['$100', '$200', '$300']
}
)
df2 = pd.DataFrame(
{
'Country': ['US', 'Canada', 'India'],
'Percentage': ['30%', '20%', '50%']
}
)
df1['Cost'] = df1['Cost'].str.lstrip('$').astype(int)
df2['Percentage'] = df2['Percentage'].str.rstrip('%').astype(int).div(100)
filtered_df_true = df1.loc[df1['Flag'] == 'True']
filtered_df_false = df1.loc[df1['Flag'] == 'False']
df3 = filtered_df_true.assign(key=1).merge(df2.assign(key=1), how = 'outer', on='key')
df3['Calculated Cost'] = df3['Cost']*df3['Percentage']
frames = [df3, filtered_df_false]
result = pd.concat(frames)
result.pop('key')
result.pop('Percentage')
print(result)

Drop groups whose variance is zero

Suppose the next df:
d={'month': ['01/01/2020', '01/02/2020', '01/03/2020', '01/01/2020', '01/02/2020', '01/03/2020'],
'country': ['Japan', 'Japan', 'Japan', 'Poland', 'Poland', 'Poland'],
'level':['A01', 'A01', 'A01', 'A00','A00', 'A00'],
'job title':['Insights Manager', 'Insights Manager', 'Insights Manager', 'Sales Director', 'Sales Director', 'Sales Director'],
'number':[0, 0.001, 0, 0, 0, np.nan],
'age':[24, 22, 45, np.nan, 60, 32]}
df=pd.DataFrame(d)
The idea is to get the variance for an specific column by group (in this case by: country, level and job title), then select the segments whose variance is below certain threshold and drop them from the original df.
However when applied:
# define variance threshold
threshold = 0.0000000001
# get the variance by group for specific column
group_vars=df.groupby(['country', 'level', 'job title']).var()['number']
# select the rows to drop
rows_to_drop = df[group_vars<threshold].index
# drop the rows in place
#df.drop(rows_to_drop, axis=0, inplace=True)
The next error arises:
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long long'
Expected dataframe would drop: Poland A00 Sales Director 0.000000e+00 for all months , as it is a segment with zero-variance.
Is it possible to reindex group_vars in order to drop it from original df?
What am I missing?
You can achieve this with transform
# define variance threshold
threshold = 0.0000000001
# get the variance by group for specific column
group_vars=df.groupby(['country', 'level', 'job title'])['number'].transform('var')
# select the rows to drop
rows_to_drop = df[group_vars<threshold].index
# drop the rows in place
df.drop(rows_to_drop, axis=0, inplace=True)
Which gives:
month country level job title number age
0 01/01/2020 Japan A01 Insights Manager 0.000 24.0
1 01/02/2020 Japan A01 Insights Manager 0.001 22.0
2 01/03/2020 Japan A01 Insights Manager 0.000 45.0

Tiering pandas column based on unique id and range cutoffs

I have one df that categorizes income into tiers across males and females and thousands of zip codes. I need to add a column to df2 that maps each person's income level by zip code (average, above average etc.).
The idea is to assign the highest cutoff exceeded by a given person's income, or assign to lowest tier by default
The income level for each tier also varies by zip code. For certain zip codes there are limited number of tiers (e.g. no very high incomes). There are also separate tiers for males by zip code not shown due to space.
I think I need to create some sort of dictionary, not sure how to handle this. any help would go a long way, thanks.
**Edit: The first df acts as a key, and I am looking to use it to assign the corresponding row value from the column 'Income Level' to df2
E.g. for a unique id in df2, compare df2['Annual Income'] to the matching id in df['Annual Income cutoff']. Then assign the highest possible Income level from df as a new row value in df2
import pandas as pd
import numpy as np
data = [['female',10009,'very high',10000000],['female',10009,'high',100000],['female',10009,'above average',75000],['female', 10009, 'average', 50000]]
df = pd.DataFrame(data, columns = ['Sex', 'Area Code', 'Income level', 'Annual Income cutoff'])
print(df)
Sex Area Code Income level Annual Income cutoff
0 female 10009 very high 10000000
1 female 10009 high 100000
2 female 10009 above average 75000
3 female 10009 average 50000
data_2 = [['female',10009, 98000], ['female', 10009, 56000]]
df2 = pd.DataFrame(data_2, columns = ['Sex', 'Area Code', 'Annual Income'])
print(df2)
Sex Area Code Annual Income
0 female 10009 98000
1 female 10009 56000
output_data = [['female',10009, 98000, 'above average'], ['female', 10009, 56000, 'average']]
final_output = pd.DataFrame(output_data, columns = ['Sex', 'Area Code', 'Annual Income', 'Income Level'])
print(final_output)
Sex Area Code Annual Income Income Level
0 female 10009 98000 above average
1 female 10009 56000 average
One way to do this is to use pd.merge_asof:
pd.merge_asof(df2.sort_values('Annual Income'),
df.sort_values('Annual Income cutoff'),
left_on = 'Annual Income',
right_on = 'Annual Income cutoff',
by=['Sex', 'Area Code'], direction = 'backward')
Output:
Sex Area Code Annual Income Income level Annual Income cutoff
0 female 10009 56000 average 50000
1 female 10009 98000 average 50000

How to check each time-series entry if name/id is in previous years entries?

I'm stuck.
I have a dataframe where rows are created at the time a customer quotes cost of a product.
My (truncated) data:
import pandas as pd
d = {'Quote Date': pd.to_datetime(['3/10/2016', '3/10/2016', '3/10/2016',
'3/10/2016', '3/11/2017']),
'Customer Name': ['Alice', 'Alice', 'Bob', 'Frank', 'Frank']
}
df = pd.DataFrame(data=d)
I want to check, for each row, if this is the first interaction I have had with this customer in over a year. My thought is to check each row's customer name against the customer name in the preceding years worth of rows. If a row's customer name is not in the previous year subset, then I will append a True value to the new column:
df['Is New']
In practice, the dataframe's shape will be close to (150000000, 5) and I fear adding a calculated column will not scale well.
I also thought to create a multi-index with the date and then customer name, but I was not sure how to execute the necessary search with this indexing.
Please apply any method you believe would be more efficient at checking for the first instance of a customer in the preceding year.
Here is the first approach that came to mind. I don't expect it to scale that well to 150M rows, but give it a try. Also, your truncated data does not produce a very interesting output, so I created some test data in which some users are new, and some are not:
# Create example data
d = {'Quote Date': pd.to_datetime(['3/10/2016',
'3/10/2016',
'6/25/2016',
'1/1/2017',
'6/25/2017',
'9/29/2017']),
'Customer Name': ['Alice', 'Bob', 'Alice', 'Frank', 'Bob', 'Frank']
}
df = pd.DataFrame(d)
df.set_index('Quote Date', inplace=True)
# Solution
day = pd.DateOffset(days=1)
is_new = [s['Customer Name'] not in df.loc[i - 365*day:i-day]['Customer Name'].values
for i, s in df.iterrows()]
df['Is New'] = is_new
df.reset_index(inplace=True)
# Result
df
Quote Date Customer Name Is New
0 2016-03-10 Alice True
1 2016-03-10 Bob True
2 2016-06-25 Alice False
3 2017-01-01 Frank True
4 2017-06-25 Bob True
5 2017-09-29 Frank False

Categories

Resources