Remove special chars and related texts from Dataframe

Remove special chars and related texts from Dataframe - python

My dataframe is slumped up with special chars and some company extensions that i'm trying to get rid of.
---
df
--
Microsoft inc
google INC
Apple Pvt Ltd
orc~l PvT ltd
Am##zon Pvt Ltd
Expected output
--
df
--
Microsoft
google
Apple
oracl
Amazon
What i tried
word_list= ['inc','INC','Pvt Ltd', 'PvT ltd']
df1= ''.join([repl if idx in word_list else idx for idx in df])

Related

Check if there is any 12-character word available in the string. If available, extract the word

I have been looking to extract only a 12-character word from the string if it exists.
Need to check if first 5 characters are from a given list and check last 3 character are numbers.
Input data (Data.xlsx):
Description Number
CHQ -AQBCN222Q546 from India Federation Pvt Ltd
CHQN#DJBNK220Q329 from Indiana Basics Software Ltd -BC003
CASH- NJRQC225J987^ from US Fertilizers LLP
CHQ - from India Bulls Pvt Ltd
AQBCN222Q989 from India Bulls Pvt Ltd
CHQ -AQCCN222Q546 from India Federation Pvt Ltd
CASH - AQBCN222Q546289 from India Federation Pvt Ltd
list_Character - ['AQBCN','PUCNQ','DJBNK','ADJBC','NJRQC']
Expected output:
Description Number
CHQ -AQBCN222Q546 from India Federation Pvt Ltd AQBCN222Q546
CHQN#DJBNK220Q329 from Indiana Basics Software Ltd -BC003 DJBNK220Q329
CASH- NJRQC225J987^ from US Fertilizers LLP NJRQC225J987
CHQ - from India Bulls Pvt Ltd
AQBCN222Q989 from India Bulls Pvt Ltd AQBCN222Q989
CHQ -AQCCN222Q546 from India Federation Pvt Ltd
CASH - AQBCN222Q546289 from India Federation Pvt Ltd
Code:
import pandas as pd
import re
df = pd.read_excel(r'D:/Users/Data.xlsx')
list_Character - ['AQBCN','PUCNQ','DJBNK','ADJBC','NJRQC']
regex = r'[#-]((?:' + r'|'.join(list_Character) + r')\w{5})\b'
df["Number"] = df["Description"].str.extract(regex)
I am not finding the solution.
I have tried getting the reference from Check if there is any 10 character word available in the string If Exist Extract the word But it did not work.

You can slightly modify the regex to remove the leading character match and match 7 extra characters:
list_Character = ['AQBCN','PUCNQ','DJBNK','ADJBC','NJRQC']
regex = r'((?:' + r'|'.join(list_Character) + r')\w{7})\b'
df["Number"] = df["Description"].str.extract(regex)
Output:
Description Number
0 CHQ -AQBCN222Q546 from India Federation Pvt Ltd AQBCN222Q546
1 CHQN#DJBNK220Q329 from Indiana Basics Software... DJBNK220Q329
2 CASH- NJRQC225J987^ from US Fertilizers LLP NJRQC225J987
3 CHQ - from India Bulls Pvt Ltd NaN
4 AQBCN222Q989 from India Bulls Pvt Ltd AQBCN222Q989
5 CHQ -AQCCN222Q546 from India Federation Pvt Ltd NaN
6 CASH - AQBCN222Q546289 from India Federation P... NaN

How can I remove the incomplete parenthesis without affect other row using pandas?

I have a data frame where I want to remove the parentheses and the string without affecting other row.
Assume the following dataframe "cusomter" with the column "Name"
Name
Company ABC (Malaysia) Ltd
Company HIJ (M) Ltd (B12
Company KLM (M) Ltd (
My code:
customer["Name"] = customer["Name"].str.replace(r"\([^()]{1,3}", "", regex=True)
Output:
Name
Company ABC aysia) Ltd
Company HIJ (M) Ltd
Company KLM (M) Ltd
My desire output:
Name
Company ABC (Malaysia) Ltd
Company HIJ (M) Ltd
Company KLM (M) Ltd
May I know if is it possible to do it? I had tried different codes and I'm still not able to get my desired output.

Assuming there won't be nested parentheses in the garbage section, you can anchor the replacement to the end of the string.
import pandas as pd
df = pd.DataFrame({"company": ["Company ABC (Malaysia) Ltd", "Company HIJ (M) Ltd (B12", "Company KLM (M) Ltd ("]})
df["company"] = df["company"].str.replace(r"\s*\([^()]*?$", "", regex=True)
print(df)
company
0 Company ABC (Malaysia) Ltd
1 Company HIJ (M) Ltd
2 Company KLM (M) Ltd

How to execute the script under a condition in python

I am looking to execute the script only when it satisfies the condition.
If Column1 is not blank then only we can use the below script else will print the message. I have tried several ways but couldn't find the possible way to work.
Sheet1
id_number company_name match_acc
IN2231D AXN pvt Ltd
UK654IN Aviva Intl Ltd
Ship Incorporations
LK0678G Oppo Mobiles pvt ltd
NG5678J Nokia Inc
Sheet2
identity_no Pincode company_name
IN2231 110030 AXN pvt Ltd
UK654IN 897653 Aviva Intl Ltd
SL1432 07658 Ship Incorporations
LK0678G 120988 Oppo Mobiles Pvt Ltd
Script i have been using
df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')
if df1[['id_number']] is not NaN:
cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())

how about
filtered = df1[df1['id_number'] != ""]
then use the filtered variable for the rest of your code

The id_number may just be an empty string and not necessarily NaN:
I usually resort to this while checking for empty column:
df[ (df[column_name].notnull()) & (df[column_name]!='') ]

Merging two columns using cross function in python

I am looking to merge two columns using cross function which i need to use later for further analysis.
Inpout Data
id_number company_name match_acc
IN2231D AXN pvt Ltd
UK654IN Aviva Intl Ltd
SL1432H Ship Incorporations
LK0678G Oppo Mobiles pvt ltd
NG5678J Nokia Inc
identity_no Pincode company_name
IN2231 110030 AXN pvt Ltd
UK654IN 897653 Aviva Intl Ltd
SL1432 07658 Ship Incorporations
LK0678G 120988 Oppo Mobiles Pvt Ltd
I am looking out to merge the column id_number with identity_no
Code i am using so far:
cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
But Getting the error:
pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False
Output what i need:
# id_number identity_no
# 0 IN2231D IN2231
# 1 IN2231D UK654IN
# 2 IN2231D SL1432
# ...
# 17 NG5678J UK654IN
# 18 NG5678J SL1432
# 19 NG5678J LK0678G
Please suggest.

how='cross'
This was a feature introduced in pd.__version__ == '1.2.0' so if you have an older version of pandas it will not work. If for some reason you cannot upgrade you can accomplish the same with the use of a helper column that is the same constant for both DataFrames that you then drop.
import pandas as pd
df1 = pd.DataFrame({'x': [1,2]})
df2 = pd.DataFrame({'y': ['a', 'b']})
# For versions >=1.2.0
df1.merge(df2, how='cross')
# x y
#0 1 a
#1 1 b
#2 2 a
#3 2 b
# For older versions assign a constant you merge on.
df1.assign(t=1).merge(df2.assign(t=1), on='t').drop(columns='t')
# x y
#0 1 a
#1 1 b
#2 2 a
#3 2 b

Replace only last occurrence of column value in DataFrame

I've a DataFrame with a Company column.
Company
-------------------------------
Tundra Corporation Art Limited
Desert Networks Incorporated
Mount Yellowhive Security Corp
Carter, Rath and Mueller Limited (USD/AC)
Barrows corporation /PACIFIC
Corporation, Mounted Security
I've a dictionary with regexes to normalize the company entities.
(^|\s)corporation(\s|$); Corp
(^|\s)Limited(\s|$); LTD
(^|\s)Incorporated(\s|$); INC
...
I need to normalize only the last occurrence. This is my desired output.
Company
-------------------------------
Tundra Corporation Art LTD
Desert Networks INC
Mount Yellowhive Security Corp
Carter, Rath and Mueller LTD (USD/AC)
Barrows Corp /PACIFIC
Corp, Mounted Security
(Only normalize Limited and not Corporation for : Tundra Corporation Art Limited)
My code:
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
Is it possible to only change the last occurrence of an entity (do i need to change my regex)?

Change (\s|$) to ($) for match end of strings:
entity_dict = {'(^|\s)corporation($)': ' Corp',
'(^|\s)Limited($)': ' LTD',
'(^|\s)Incorporated($)': ' INC'}
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
EDIT: You can simplify dictionary for no regex, then create lowercase dict for possible use Series.str.findall, get last value of indexing str[-1] and Series.map by lowercase dict, last replace in list comprension:
entity_dict = {'corporation': 'Corp',
'Limited': 'LTD',
'Incorporated': 'INC'}
lower = {k.lower():v for k, v in entity_dict.items()}
s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('')
s2 = s1.str.lower().map(lower).fillna('')
df['Company'] = [a.replace(b, c) for a, b, c in zip(df['Company'], s1, s2)]
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
3 Carter, Rath and Mueller LTD (USD/AC)
4 Barrows Corp /PACIFIC
5 Corp, Mounted Security

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove special chars and related texts from Dataframe - python

Related

Check if there is any 12-character word available in the string. If available, extract the word

How can I remove the incomplete parenthesis without affect other row using pandas?

How to execute the script under a condition in python

Merging two columns using cross function in python

Replace only last occurrence of column value in DataFrame

Categories

Resources