Replace only last occurrence of column value in DataFrame - python

I've a DataFrame with a Company column.
Company
-------------------------------
Tundra Corporation Art Limited
Desert Networks Incorporated
Mount Yellowhive Security Corp
Carter, Rath and Mueller Limited (USD/AC)
Barrows corporation /PACIFIC
Corporation, Mounted Security
I've a dictionary with regexes to normalize the company entities.
(^|\s)corporation(\s|$); Corp
(^|\s)Limited(\s|$); LTD
(^|\s)Incorporated(\s|$); INC
...
I need to normalize only the last occurrence. This is my desired output.
Company
-------------------------------
Tundra Corporation Art LTD
Desert Networks INC
Mount Yellowhive Security Corp
Carter, Rath and Mueller LTD (USD/AC)
Barrows Corp /PACIFIC
Corp, Mounted Security
(Only normalize Limited and not Corporation for : Tundra Corporation Art Limited)
My code:
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
Is it possible to only change the last occurrence of an entity (do i need to change my regex)?

Change (\s|$) to ($) for match end of strings:
entity_dict = {'(^|\s)corporation($)': ' Corp',
'(^|\s)Limited($)': ' LTD',
'(^|\s)Incorporated($)': ' INC'}
for k, v in entity_dict.items():
df['Company'].replace(regex=True, inplace=True, to_replace=re.compile(k,re.I), value=v)
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
EDIT: You can simplify dictionary for no regex, then create lowercase dict for possible use Series.str.findall, get last value of indexing str[-1] and Series.map by lowercase dict, last replace in list comprension:
entity_dict = {'corporation': 'Corp',
'Limited': 'LTD',
'Incorporated': 'INC'}
lower = {k.lower():v for k, v in entity_dict.items()}
s1 = df['Company'].str.findall('|'.join(lower.keys()), flags=re.I).str[-1].fillna('')
s2 = s1.str.lower().map(lower).fillna('')
df['Company'] = [a.replace(b, c) for a, b, c in zip(df['Company'], s1, s2)]
print (df)
Company
0 Tundra Corporation Art LTD
1 Desert Networks INC
2 Mount Yellowhive Security Corp
3 Carter, Rath and Mueller LTD (USD/AC)
4 Barrows Corp /PACIFIC
5 Corp, Mounted Security

Related

How can I remove the incomplete parenthesis without affect other row using pandas?

I have a data frame where I want to remove the parentheses and the string without affecting other row.
Assume the following dataframe "cusomter" with the column "Name"
Name
Company ABC (Malaysia) Ltd
Company HIJ (M) Ltd (B12
Company KLM (M) Ltd (
My code:
customer["Name"] = customer["Name"].str.replace(r"\([^()]{1,3}", "", regex=True)
Output:
Name
Company ABC aysia) Ltd
Company HIJ (M) Ltd
Company KLM (M) Ltd
My desire output:
Name
Company ABC (Malaysia) Ltd
Company HIJ (M) Ltd
Company KLM (M) Ltd
May I know if is it possible to do it? I had tried different codes and I'm still not able to get my desired output.
Assuming there won't be nested parentheses in the garbage section, you can anchor the replacement to the end of the string.
import pandas as pd
df = pd.DataFrame({"company": ["Company ABC (Malaysia) Ltd", "Company HIJ (M) Ltd (B12", "Company KLM (M) Ltd ("]})
df["company"] = df["company"].str.replace(r"\s*\([^()]*?$", "", regex=True)
print(df)
company
0 Company ABC (Malaysia) Ltd
1 Company HIJ (M) Ltd
2 Company KLM (M) Ltd

Split dataframe column based on specific list of words

Is it possible to split strings from a dataframe column based on a list of words?
For example: There is a dataframe with a column Company, each record includes the company name, a legal form, and sometimes additional information after the legal form like 'electronics'.
Company
XYZ ltd electronics
ABC ABC inc iron
AB XY Z inc
CD EF GHI JK llc incident
I have list with 1500 worldwide legal form for companies (inc, ltd, ...). I would like to split the string in the dataframe column, based on this legal form list for example:
['gmbh', 'ltd', 'inc', 'srl', 'spa', 'co', 'sa', 'ag', 'kg', 'ab', 'spol', 'sasu', 'sas', 'pvt', 'sarl', 'gmbh & co kg', 'llc', 'ilc', 'corp', 'ltda', 'coltd', 'se', 'as', 'sp zoo', 'plc', 'pvtltd', 'og', 'gen']
In other words, to separate everything before and after the words in the list to new columns. This is the desired output:
Company
Legal form
Addition
XYZ
ltd
electronics
ABC ABC
inc
iron
AB XY Z
inc
CD EF GHI JK
llc
incident
Note that "inc" appears in the middle, at the end, and also part of a word in the various company name examples.
You could use regular expression (Regex) to filter out the legal form. Each legal form is in this format: \slegalform\s
\s means the legal form is preceded by and ended with a whitespace. Because I have appended all company names with a white space, so the legal form can be at the end as well. The data is processed in lowercase, then converted back to Title Case. So try this:
import pandas as pd
import re
legal_forms = '(\sgmbh\s|\sltd\s|\sinc\s|\ssrl\s|\sspa\s|\sco\s|\ssa\s|\sag\s|\skg\s|\sab\s|\sspol\s|\ssasu\s|\ssas\s|\spvt\s|\ssarl\s|\sgmbh\s&\sco\skg\s|\sllc\s|\silc\s|\scorp\s|\sltda\s|\scoltd\s|\sse\s|\sas\s|\ssp\szoo\s|\splc\s|\spvtltd\s|\sog\s|\sgen\s)'
df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC ABC inc iron', 'AB XY Z inc', 'CD EF GHI JK llc incident']}, columns=['Company'])
df['Coy']= df['Company'].apply(lambda x: [e.strip() for e in re.split(legal_forms, x.lower()+' ')])
print(df)
This will create a list for each company name, separated by the legal form
Company Coy
0 XYZ ltd electronics [xyz, ltd, electronics]
1 ABC ABC inc iron [abc abc, inc, iron]
2 AB XY Z inc [ab xy z, inc, ]
3 CD EF GHI JK llc incident [cd ef ghi jk, llc, incident]
After that you can split them into 3 separate columns:
df1 = pd.DataFrame(df['Coy'].tolist(), columns=['Company', 'Legal form', 'Addition'])
for col in df1.columns:
df1[col] = df1[col].str.title()
print(df1)
Output:
Company Legal form Addition
0 Xyz Ltd Electronics
1 Abc Abc Inc Iron
2 Ab Xy Z Inc
3 Cd Ef Ghi Jk Llc Incident
Assuming you are just trying to string split after spaces you could try something like this:
import re
df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC ABC inc iron', 'AB XY Z inc', 'CD EF GHI JK llc chicago']}, columns=['Company'])
df['Addition'] = df['Company'].apply(lambda x: re.split('(ltd|inc|llc)', x))
print(df)
Company Addition
0 XYZ ltd electronics [XYZ , ltd, electronics]
1 ABC ABC inc iron [ABC ABC , inc, iron]
2 AB XY Z inc [AB XY Z , inc, ]
3 CD EF GHI JK llc chicago [CD EF GHI JK , llc, chicago]

Change the value of a dataframe column to the value of a second column conditional on the value of a third column in pandas

I have data with current names of companies, old names, and the date of name changes. It looks like this:
name
former_name1
name_change_date1
ACMAT CORP
nan
NaT
ACME ELECTRIC CORP
nan
NaT
ACME UNITED CORP
nan
NaT
COLUMBIA ACORN TRUST
LIBERTY ACORN TRUST
2003-10-20
MULTIGRAPHICS INC
AM INTERNATIONAL INC
1997-03-17
MILLER LLOYD I III
nan
NaT
AFFILIATED COMPUTER SERVICES INC
nan
NaT
ADAMS RESOURCES & ENERGY, INC.
ADAMS RESOURCES & ENERGY INC
2005-04-01
BK Technologies Corp
BK Technologies, Inc.
2019-03-28
I want to figure out what the name of each company was at a particular date. Let's say I want to figure out the name of a company as of January 1st 2002. Then I could create a new column called say, edited_name, which would contain the current name of the company unless the company has changed names since 1/1/2002, in which case it would contain the historical name (i.e. former_name1) of the company. So the output should look something like this:
name
former_name1
name_change_date1
edited_name
ACMAT CORP
nan
NaT
ACMAT CORP
ACME ELECTRIC CORP
nan
NaT
ACME ELECTRIC CORP
ACME UNITED CORP
nan
NaT
ACME UNITED CORP
COLUMBIA ACORN TRUST
LIBERTY ACORN TRUST
2003-10-20
LIBERTY ACORN TRUST
MULTIGRAPHICS INC
AM INTERNATIONAL INC
1997-03-17
MULTIGRAPHICS INC
MILLER LLOYD I III
nan
NaT
MILLER LLOYD I III
AFFILIATED COMPUTER SERVICES INC
nan
NaT
AFFILIATED COMPUTER SERVICES INC
ADAMS RESOURCES & ENERGY, INC.
ADAMS RESOURCES & ENERGY INC
2005-04-01
ADAMS RESOURCES & ENERGY INC
BK Technologies Corp
BK Technologies, Inc.
2019-03-28
BK Technologies, Inc.
In Stata (with which I am much more familiar) this could be easily accomplished with:
gen edited_name = name
replace edited_name = former_name1 if name_change_date_1 > date("2002-01-01", "YMD") & name_change_date_1 != .
Unfortunately I am at a loss of how to accomplish this in Python/Pandas.
Data:
{'name': ['ACMAT CORP', 'ACME ELECTRIC CORP', 'ACME UNITED CORP', 'COLUMBIA ACORN TRUST',
'MULTIGRAPHICS INC', 'MILLER LLOYD I III', 'AFFILIATED COMPUTER SERVICES INC',
'ADAMS RESOURCES & ENERGY, INC.', 'BK Technologies Corp'],
'former_name1': [nan, nan, nan, 'LIBERTY ACORN TRUST', 'AM INTERNATIONAL INC', nan, nan,
'ADAMS RESOURCES & ENERGY INC', 'BK Technologies, Inc.'],
'name_change_date1': [NaT, NaT, NaT, '2003-10-20', '1997-03-17', NaT, NaT,
'2005-04-01', '2019-03-28']}
You could use numpy.where to select values depending on if a name change occurred or not:
import numpy as np
df['edited_name'] = np.where(df['name_change_date1'].notna() &
df['name_change_date1'].gt(pd.to_datetime('1/1/2002')),
df['former_name1'], df['name'])
or with mask:
df['edited_name'] = df['name'].mask(df['name_change_date1'].notna() &
df['name_change_date1'].gt(pd.to_datetime('1/1/2002')),
df['former_name1'])
Output:
name former_name1 \
0 ACMAT CORP NaN
1 ACME ELECTRIC CORP NaN
2 ACME UNITED CORP NaN
3 COLUMBIA ACORN TRUST LIBERTY ACORN TRUST
4 MULTIGRAPHICS INC AM INTERNATIONAL INC
5 MILLER LLOYD I III NaN
6 AFFILIATED COMPUTER SERVICES INC NaN
7 ADAMS RESOURCES & ENERGY, INC. ADAMS RESOURCES & ENERGY INC
8 BK Technologies Corp BK Technologies, Inc.
name_change_date1 edited_name
0 NaT ACMAT CORP
1 NaT ACME ELECTRIC CORP
2 NaT ACME UNITED CORP
3 2003-10-20 LIBERTY ACORN TRUST
4 1997-03-17 MULTIGRAPHICS INC
5 NaT MILLER LLOYD I III
6 NaT AFFILIATED COMPUTER SERVICES INC
7 2005-04-01 ADAMS RESOURCES & ENERGY INC
8 2019-03-28 BK Technologies, Inc.
Use:
import numpy as np
df = pd.DataFrame({'name':['a', 'b', 'c', 'd'], 'fname':[np.nan, 'h', 's', np.nan], 'dc':[np.nan, '2003-10-20', '1997-03-17', np.nan]})
df['dc'] = pd.to_datetime(df['dc'])
df['nname'] = df['fname'][df['dc']>'1/1/2002']
res = df['name'][df['nname'].isna()]
temp = df['fname'][df['nname'].notna()]
res = res.append(temp)
df['res']=res
output:

Remove special chars and related texts from Dataframe

My dataframe is slumped up with special chars and some company extensions that i'm trying to get rid of.
---
df
--
Microsoft inc
google INC
Apple Pvt Ltd
orc~l PvT ltd
Am##zon Pvt Ltd
Expected output
--
df
--
Microsoft
google
Apple
oracl
Amazon
What i tried
word_list= ['inc','INC','Pvt Ltd', 'PvT ltd']
df1= ''.join([repl if idx in word_list else idx for idx in df])

Inserting a header row for pandas dataframe

I have just started python and am trying to rewrite one of my perl scripts in python. Essentially, I had a long script to convert a csv to json.
I've tried to import my csv into a pandas dataframe, and wanted to insert a header row at the top, since my csv lacks that.
Code:
import pandas
db=pandas.read_csv("netmedsdb.csv",header=None)
db
Output:
0 1 2 3
0 3M CAVILON NO STING BARRIER FILM SPRAY 28ML OTC 0 Rs.880.00 3M INDIA LTD
1 BACTI BAR SOAP 75GM OTC Rs.98.00 6TH SKIN PHARMACEUTICALS PVT LTD
2 KWIKNIC MINT FLAVOUR 4MG CHEW GUM TABLET 30'S NICOTINE Rs.180.00 A S V LABORATORIES INDIA PVT LTD
3 RIFAGO 550MG TABLET 10'S RIFAXIMIN 550MG Rs.298.00 AAREEN HEALTHCARE
4 999 OIL 60ML AYURVEDIC MEDICINE Rs.120.00 AAKASH PHARMACEUTICALS
5 AKASH SOAP 75GM AYURVEDIC PRODUCT Rs.80.00 AAKASH PHARMACEUTICALS
6 GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
7 GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
8 RHUNS OIL 30ML AYURVEDIC Rs.50.00 AAKASH PHARMACEUTICALS
9 VILLO CAPSULE 10'S AYURVEDIC MEDICINE Rs.70.00 AAKASH PHARMACEUTICALS
10 VITAWIN FORTE CAPSULE 10'S AYURVEDIC MEDICINE Rs.150.00 AAKASH PHARMACEUTICALS
I wrote the following code to insert the first element at row 0,coloumn 0:
db.insert(loc=0,column='0',value='Brand')
db
Output:
0 0 1 2 3
0 Brand 3M CAVILON NO STING BARRIER FILM SPRAY 28ML OTC 0 Rs.880.00 3M INDIA LTD
1 Brand BACTI BAR SOAP 75GM OTC Rs.98.00 6TH SKIN PHARMACEUTICALS PVT LTD
2 Brand KWIKNIC MINT FLAVOUR 4MG CHEW GUM TABLET 30'S NICOTINE Rs.180.00 A S V LABORATORIES INDIA PVT LTD
3 Brand RIFAGO 550MG TABLET 10'S RIFAXIMIN 550MG Rs.298.00 AAREEN HEALTHCARE
4 Brand 999 OIL 60ML AYURVEDIC MEDICINE Rs.120.00 AAKASH PHARMACEUTICALS
5 Brand AKASH SOAP 75GM AYURVEDIC PRODUCT Rs.80.00 AAKASH PHARMACEUTICALS
6 Brand GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
7 Brand GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
8 Brand RHUNS OIL 30ML AYURVEDIC Rs.50.00 AAKASH PHARMACEUTICALS
9 Brand VILLO CAPSULE 10'S AYURVEDIC MEDICINE Rs.70.00 AAKASH PHARMACEUTICALS
10 Brand VITAWIN FORTE CAPSULE 10'S AYURVEDIC MEDICINE Rs.150.00 AAKASH PHARMACEUTICALS
But unfortunately I got the word "Brand" inserted at coloumn 0 in all rows.
I'm trying to add the header coloumns "Brand", "Generic", "Price", "Company"
Need parameter names in read_csv only:
import pandas as pd
temp=u"""a,b,10,d
e,f,45,r
"""
#after testing replace 'pd.compat.StringIO(temp)' to 'netmedsdb.csv'
df = pd.read_csv(pd.compat.StringIO(temp), names=["Brand", "Generic", "Price", "Company"])
print (df)
Brand Generic Price Company
0 a b 10 d
1 e f 45 r

Categories

Resources