How to recode the columns based on condition - python

I have a bigdata to analyze that includes many rows with columns.
I would like to make a new column('Recode_Brand') copying 'Brand' column based on the condition that only displays Top 10 brands and 'Others'
Then how can I make the condition and the logic?
It will be perfect if I can use the condition as below;
Brand_list = ['Google', 'Apple', 'Amazon', 'Microsoft', 'Tencent', 'Facebook', 'Visa', 'McDonald's', 'Alibaba', 'AT&T']
I am quite new to Pandas and need your support. Highly appreciate in advance.
enter image description here

Just use the 2018 column, for example:
df['Recode_Brand'] = df.apply(lambda row: row['Brand'] if row['2018'] <= 10 else 'Other', axis=1)
Or otherwise if you need that brands list you can do:
Brand_list = ["Google", "Apple", "Amazon", "Microsoft", "Tencent", "Facebook", "Visa", "McDonald's", "Alibaba", "AT&T"]
df['Recode_Brand'] = df.apply(lambda row: row['Brand'] if row['Brand'] in Brand_list else 'Other', axis=1)
NB If your string contains a ' character as in McDonald's, you have to either wrap it in double quotes ", or to escape that character with \'.

Use numpy.where to find Brand in top10 and add a new column:
df = pd.DataFrame({'2018':[7,8,3,12,15,16,10,9,4,5,11,1,14,2,13,6],
'Brand':['Google','Apple','Amazon','Microsoft','Tencent','Facebook','Visa','McDonalds','Alibaba','AT&T',
'IBM','Verizon','Marlboro','Coca-Cola','MasterCard','UPS']})
Create a new series with top 10 brands
top10 = df.nsmallest(10, '2018')
And add a new column, Recode_Brand if brand is in top10 else 'Others'
df['Recode_Brand'] = np.where((df['Brand'].eq(top10['Brand'])),df['Brand'],'Others')
print(df)
2018 Brand Recode_Brand
0 7 Google Google
1 8 Apple Apple
2 3 Amazon Amazon
3 12 Microsoft Others
4 15 Tencent Others
5 16 Facebook Others
6 10 Visa Visa
7 9 McDonalds McDonalds
8 4 Alibaba Alibaba
9 5 AT&T AT&T
10 11 IBM Others
11 1 Verizon Verizon
12 14 Marlboro Others
13 2 Coca-Cola Coca-Cola
14 13 MasterCard Others
15 6 UPS UPS

Related

Melt DataFrame with two value variables

I have a dataframe with inventory and purchases across multiple stores and regions. I am trying to stack the dataframe using melt, but I need to have two value columns, inventory and purchases, and can't figure out how to do that. The dataframe looks like this:
Region | Store | Inventory_Item_1 | Inventory_Item_2 | Purchase_Item_1 | Purchase_Item_2
------------------------------------------------------------------------------------------------------
North A 15 20 5 6
North B 20 25 7 8
North C 18 22 6 10
South D 10 15 9 7
South E 12 12 10 8
The format I am trying to get the dataframe into looks like this:
Region | Store | Item | Inventory | Purchases
-----------------------------------------------------------------------------
North A Inventory_Item_1 15 5
North A Inventory_Item_2 20 6
North B Inventory_Item_1 20 7
North B Inventory_Item_2 25 8
North C Inventory_Item_1 18 6
North C Inventory_Item_2 22 10
South D Inventory_Item_1 10 9
South D Inventory_Item_2 15 7
South E Inventory_Item_1 12 10
South E Inventory_Item_2 12 8
This is what I have written, but I don't know how to create columns for Inventory and Purchases. Note that my full dataframe is considerably larger (50+ regions, 140+ stores, 15+ items).
df_1 = df.melt(id_vars = ['Store','Region'],value_vars = ['Inventory_Item_1','Inventory_Item_2'])
Any help or advice would be appreciated!
I would do these with hierarchical indexes on the rows and columns.
For the rows, you can set_index(['Region', 'Store']) easily enough.
You have to get a little tricksy for the columns though. Since you need access to the non-index columns that result from setting the index on Region and Store, you need to pipe it to a custom function that builds the desired tuples and creates a name multi-level column index.
After that, you can stack the columns into the row index and optionally reset the full row index to make everything a normal column again.
df = pd.DataFrame({
'Region': ['North', 'North', 'North', 'South', 'South'],
'Store': ['A', 'B', 'C', 'D', 'E'],
'Inventory_Item_1': [15, 20, 18, 10, 12],
'Inventory_Item_2': [20, 25, 22, 15, 12],
'Purchase_Item_1': [5, 7, 6, 9, 10],
'Purchase_Item_2': [6, 8, 10, 7, 8]
})
output = (
df.set_index(['Region', 'Store'])
.pipe(lambda df:
df.set_axis(df.columns.str.split('_', n=1, expand=True), axis='columns')
)
.rename_axis(['Status', 'Product'], axis='columns')
.stack(level='Product')
.reset_index()
)
Which gives me:
Region Store Product Inventory Purchase
North A Item_1 15 5
North A Item_2 20 6
North B Item_1 20 7
North B Item_2 25 8
North C Item_1 18 6
North C Item_2 22 10
South D Item_1 10 9
South D Item_2 15 7
South E Item_1 12 10
South E Item_2 12 8
You can use the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github :
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.pivot_longer(
index=["Region", "Store"],
names_to=(".value", "item"),
names_pattern=r"(Inventory|Purchase)_(.+)",
sort_by_appearance=True,
)
Region Store item Inventory Purchase
0 North A Item_1 15 5
1 North A Item_2 20 6
2 North B Item_1 20 7
3 North B Item_2 25 8
4 North C Item_1 18 6
5 North C Item_2 22 10
6 South D Item_1 10 9
7 South D Item_2 15 7
8 South E Item_1 12 10
9 South E Item_2 12 8
It works by passing a regex, containing groups to names_pattern parameter. The '.value' in names_to ensures that Inventory and Purchase are kept as column headers while the other group(Item_1 and Item_2) are collated into a new group item.
You can get to there by these steps:
# please always provide minimal working code - we as helpers and answerers
# otherwise have to invest extra time to generate beginning working code
# and that is unfair - we already spend enough time to solve the problem:
df = pd.DataFrame([
["North","A",15,20,5,6],
["North","B",20,25,7,8],
["North","C",18,22,6,10],
["South","D",10,15,9,7],
["South","E",12,12,10,8]], columns=["Region","Store","Inventory_Item_1","Inventory_Item_2","Purchase_Item_1","Purchase_Item_2"])
# melt the dataframe completely first
df_final = pd.melt(df, id_vars=['Region', 'Store'], value_vars=['Inventory_Item_1', 'Inventory_Item_2', 'Purchase_Item_1', 'Purchase_Item_2'])
# extract inventory and purchase sub data frames
# they have in common the "variable" column (the item number!)
# so let it look exactly the same in both data frames by removing
# unnecessary parts
df_inventory = df_final.loc[[x.startswith("Inventory") for x in df_final.variable],:]
df_inventory.variable = [s.replace("Inventory_", "") for s in df_inventory.variable]
df_purchase = df_final.loc[[x.startswith("Purchase") for x in df_final.variable],:]
df_purchase.variable = [s.replace("Purchase_", "") for s in df_purchase.variable]
# deepcopy the data frames (just to keep old results so that you can inspect them)
df_purchase_ = df_purchase.copy()
df_inventory_ = df_inventory.copy()
# rename the columns to prepare for merging
df_inventory_.columns = ["Region", "Store", "variable", "Inventory"]
df_purchase_.columns = ["Region", "Store", "variable", "Purchase"]
# merge by the three common columns
df_final_1 = pd.merge(df_inventory_, df_purchase_, how="left", left_on=["Region", "Store", "variable"], right_on=["Region", "Store", "variable"])
# sort by the three common columns
df_final_1.sort_values(by=["Region", "Store", "variable"], axis=0)
This returns
Region Store variable Inventory Purchase
0 North A Item_1 15 5
5 North A Item_2 20 6
1 North B Item_1 20 7
6 North B Item_2 25 8
2 North C Item_1 18 6
7 North C Item_2 22 10
3 South D Item_1 10 9
8 South D Item_2 15 7
4 South E Item_1 12 10
9 South E Item_2 12 8

How to spread the data in pandas?

i'm working on spread r equivalent in pandas my dataframe looks like below
Name age Language year Period
Nik 18 English 2018 Beginer
John 19 French 2019 Intermediate
Kane 33 Russian 2017 Advanced
xi 44 Thai 2015 Beginer
and looking for output like this
Name age Language Beginer Intermediate Advanced
Nik 18 English 2018
John 19 French 2019
Kane 33 Russian 2017
John 44 Thai 2015
my code
pd.pivot(x1,values='year', columns=['Period'])
i'm getting only these columns Beginer,Intermediate,Advanced not the entire dataframe
while reshaping it i tried using index but says no duplicates in index.
So i created new index column but still not getting entire dataframe
If I understood correctly you could do something like this:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Advanced Beginner Intermediate
0 Nik 18 English 0 2018 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 2017 0 0
3 xi 44 Thai 0 2015 0
If you want to match the exact same output, convert the column to categorical first, and specify the order:
# encode as categorical
df['Period'] = pd.Categorical(df['Period'], ['Beginner', 'Advanced', 'Intermediate'], ordered=True)
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018 0 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 0 2017 0
3 xi 44 Thai 2015 0 0
Finally if you want to replace the 0, with missing values, add a third step:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
res = res.replace(0, np.nan)
Output (with missing values)
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018.0 NaN NaN
1 John 19 French NaN NaN 2019.0
2 Kane 33 Russian NaN 2017.0 NaN
3 xi 44 Thai 2015.0 NaN NaN
One way you can get to the equivalent of R's spread function using pd.pivot_table:
If you don't mind about the index, you can use reset_index() on the newly created df:
new_df = (pd.pivot_table(df, index=['Name','age','Language'],columns='Period',values='year',aggfunc='sum')).reset_index()
which will get you:
Period Name age Language Advanced Beginer Intermediate
0 John 19 French NaN NaN 2019.0
1 Kane 33 Russian 2017.0 NaN NaN
2 Nik 18 English NaN 2018.0 NaN
3 xi 44 Thai NaN 2015.0 NaN
EDIT
If you have many columns in your dataframe and you want to include them in the reshaped dataset:
Grab in a list the columns to be used in pivot table (i.e. Period and year)
Grab all the other columns in your dataframe in a list (using not in)
Use the index_cols as index in the pd.pivot_table() command
non_index_cols = ['Period','year'] # SPECIFY THE 2 COLUMNS IN THE PIVOT TABLE TO BE USED
index_cols = [i for i in df.columns if i not in non_index_cols] # GET ALL THE REST IN A LIST
new_df = (pd.pivot_table(df, index=index_cols,columns='Period',values='year',aggfunc='sum')).reset_index()
The new_df, will include all the columns of your initial dataframe.

Querying panda DataFrame column to the main datasource

Hi Guys I'm new to python and I want to learn how to query a data column to my data source.
this is my panda dataframe
[In] top10_athletes = athletes.head(10)
top10_athletes = top10_athletes.rename(columns={'index': 'Name', 'Name': 'Medal Count'})
top10_athletes.index = np.arange(1,len(top10_athletes)+1)
top10_athletes
[Out]
Name Medal Count
1 Michael Fred Phelps, II 28
2 Larysa Semenivna Latynina (Diriy-) 18
3 Nikolay Yefimovich Andrianov 15
4 Ole Einar Bjrndalen 13
5 Borys Anfiyanovych Shakhlin 13
6 Edoardo Mangiarotti 13
7 Takashi Ono 13
8 Birgit Fischer-Schmidt 12
9 Paavo Johannes Nurmi 12
10 Sawao Kato 12
I want to query all the values in the Name column into my main data source
the only way that I could think of is this piece of code I tried searching in Google
df.query("Name == 'Michael Fred Phelps, II'")
Thanks guys!
To query by Name column you can use the following:
df[df["Name"] == 'Michael Fred Phelps, II')

How to condense text cleaning steps into single Python function?

Newer programmer here, deeply appreciate any help this knowledgeable community is willing to provide.
I have a column of 140,000 text strings (company names) in a pandas dataframe on which I want to strip all whitespace everywhere in/around the strings, remove all punctuation, substitute specific substrings, and uniformly transform to lowercase. I want to then take the first 0:10 elements in the strings and store them in a new dataframe column.
Here is a reproducible example.
import string
import pandas as pd
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
def remove_punctuations(text):
for punctuation in string.punctuation:
text = text.replace(punctuation, '')
return text
# applying remove_punctuations function
df['co_name_transform'] = df['co_name'].apply(remove_punctuations)
# this next step replaces 'Saint' with 'st' to standardize,
# and I may want to make other substitutions but this is a common one.
df['co_name_transform'] = df.co_name_transform.str.replace('Saint', 'st')
# replace whitespace
df['co_name_transform'] = df.co_name_transform.str.replace(' ', '')
# make lowercase
df['co_name_transform'] = df.co_name_transform.str.lower()
# select first 0:10 of strings
df['co_name_transform'] = df.co_name_transform.str[0:10]
print(df)
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
How can I put all these steps into a single function like this?
def clean_text(df[col]):
for co in co_name:
do_all_the_steps
return df[new_col]
Thank you
You don't need a function to do this. Try the following one-liner.
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Final output will be.
co_name co_name_transform
0 West Georgia Co westgeorgi
1 W.B. Carell Clockmakers wbcarellcl
2 Spine & Orthopedic LLC spineortho
3 LRHS Saint Jose's Grocery lrhsstjose
4 Optitech#NYCityScape optitechny
You can do all the steps in the function you pass to the apply method:
import re
df['co_name_transform'] = df['co_name'].apply(lambda s: re.sub(r'[\W_]+', '', s).replace('Saint', 'st').lower()[:10])
Another solution, similar to the previous one, but with the list of "to_replace" in one dictionary, so you can add more items to replace. Also, the previous solution won't give the first 10.
data = ["West Georgia Co",
"W.B. Carell Clockmakers",
"Spine & Orthopedic LLC",
"LRHS Saint Jose's Grocery",
"Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape","Optitech#NYCityScape"]
df = pd.DataFrame(data, columns = ['co_name'])
to_replace = {'[^A-Za-z0-9-]+':'','Saint':'st'}
for i in to_replace :
df['co_name'] = df['co_name'].str.replace(i,to_replace[i]).str.lower()
df['co_name'][0:10]
Result :
0 westgeorgiaco
1 wbcarellclockmakers
2 spineorthopedicllc
3 lrhssaintjosesgrocery
4 optitechnycityscape
5 optitechnycityscape
6 optitechnycityscape
7 optitechnycityscape
8 optitechnycityscape
9 optitechnycityscape
Name: co_name, dtype: object
Previous solution ( won't show the first 10)
df['co_name_transform'] = df['co_name'].str.replace('[^A-Za-z0-9-]+', '').str.replace('Saint', 'st').str.lower().str[0:10]
Result :
0 westgeorgi
1 wbcarellcl
2 spineortho
3 lrhssaintj
4 optitechny
5 optitechny
6 optitechny
7 optitechny
8 optitechny
9 optitechny
10 optitechny
11 optitechny
12 optitechny
Name: co_name_transform, dtype: object

Rename the less frequent categories by "OTHER" python

In my dataframe I have some categorical columns with over 100 different categories. I want to rank the categories by the most frequent. I keep the first 9 most frequent categories and the less frequent categories rename them automatically by: OTHER
Example:
Here my df :
print(df)
Employee_number Jobrol
0 1 Sales Executive
1 2 Research Scientist
2 3 Laboratory Technician
3 4 Sales Executive
4 5 Research Scientist
5 6 Laboratory Technician
6 7 Sales Executive
7 8 Research Scientist
8 9 Laboratory Technician
9 10 Sales Executive
10 11 Research Scientist
11 12 Laboratory Technician
12 13 Sales Executive
13 14 Research Scientist
14 15 Laboratory Technician
15 16 Sales Executive
16 17 Research Scientist
17 18 Research Scientist
18 19 Manager
19 20 Human Resources
20 21 Sales Executive
valCount = df['Jobrol'].value_counts()
valCount
Sales Executive 7
Research Scientist 7
Laboratory Technician 5
Manager 1
Human Resources 1
I keep the first 3 categories then I rename the rest by "OTHER", how should I proceed?
Thanks.
Convert your series to categorical, extract categories whose counts are not in the top 3, add a new category e.g. 'Other', then replace the previously calculated categories:
df['Jobrol'] = df['Jobrol'].astype('category')
others = df['Jobrol'].value_counts().index[3:]
label = 'Other'
df['Jobrol'] = df['Jobrol'].cat.add_categories([label])
df['Jobrol'] = df['Jobrol'].replace(others, label)
Note: It's tempting to combine categories by renaming them via df['Jobrol'].cat.rename_categories(dict.fromkeys(others, label)), but this won't work as this will imply multiple identically labeled categories, which isn't possible.
The above solution can be adapted to filter by count. For example, to include only categories with a count of 1 you can define others as so:
counts = df['Jobrol'].value_counts()
others = counts[counts == 1].index
Use value_counts with numpy.where:
need = df['Jobrol'].value_counts().index[:3]
df['Jobrol'] = np.where(df['Jobrol'].isin(need), df['Jobrol'], 'OTHER')
valCount = df['Jobrol'].value_counts()
print (valCount)
Research Scientist 7
Sales Executive 7
Laboratory Technician 5
OTHER 2
Name: Jobrol, dtype: int64
Another solution:
N = 3
s = df['Jobrol'].value_counts()
valCount = s.iloc[:N].append(pd.Series(s.iloc[N:].sum(), index=['OTHER']))
print (valCount)
Research Scientist 7
Sales Executive 7
Laboratory Technician 5
OTHER 2
dtype: int64
One line solution:
limit = 500
df['Jobrol'] = df['Jobrol'].map({x[0]: x[0] if x[1] > limit else 'other' for x in dict(df['Jobrol'].value_counts()).items()})

Categories

Resources