Inserting a header row for pandas dataframe - python

I have just started python and am trying to rewrite one of my perl scripts in python. Essentially, I had a long script to convert a csv to json.
I've tried to import my csv into a pandas dataframe, and wanted to insert a header row at the top, since my csv lacks that.
Code:
import pandas
db=pandas.read_csv("netmedsdb.csv",header=None)
db
Output:
0 1 2 3
0 3M CAVILON NO STING BARRIER FILM SPRAY 28ML OTC 0 Rs.880.00 3M INDIA LTD
1 BACTI BAR SOAP 75GM OTC Rs.98.00 6TH SKIN PHARMACEUTICALS PVT LTD
2 KWIKNIC MINT FLAVOUR 4MG CHEW GUM TABLET 30'S NICOTINE Rs.180.00 A S V LABORATORIES INDIA PVT LTD
3 RIFAGO 550MG TABLET 10'S RIFAXIMIN 550MG Rs.298.00 AAREEN HEALTHCARE
4 999 OIL 60ML AYURVEDIC MEDICINE Rs.120.00 AAKASH PHARMACEUTICALS
5 AKASH SOAP 75GM AYURVEDIC PRODUCT Rs.80.00 AAKASH PHARMACEUTICALS
6 GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
7 GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
8 RHUNS OIL 30ML AYURVEDIC Rs.50.00 AAKASH PHARMACEUTICALS
9 VILLO CAPSULE 10'S AYURVEDIC MEDICINE Rs.70.00 AAKASH PHARMACEUTICALS
10 VITAWIN FORTE CAPSULE 10'S AYURVEDIC MEDICINE Rs.150.00 AAKASH PHARMACEUTICALS
I wrote the following code to insert the first element at row 0,coloumn 0:
db.insert(loc=0,column='0',value='Brand')
db
Output:
0 0 1 2 3
0 Brand 3M CAVILON NO STING BARRIER FILM SPRAY 28ML OTC 0 Rs.880.00 3M INDIA LTD
1 Brand BACTI BAR SOAP 75GM OTC Rs.98.00 6TH SKIN PHARMACEUTICALS PVT LTD
2 Brand KWIKNIC MINT FLAVOUR 4MG CHEW GUM TABLET 30'S NICOTINE Rs.180.00 A S V LABORATORIES INDIA PVT LTD
3 Brand RIFAGO 550MG TABLET 10'S RIFAXIMIN 550MG Rs.298.00 AAREEN HEALTHCARE
4 Brand 999 OIL 60ML AYURVEDIC MEDICINE Rs.120.00 AAKASH PHARMACEUTICALS
5 Brand AKASH SOAP 75GM AYURVEDIC PRODUCT Rs.80.00 AAKASH PHARMACEUTICALS
6 Brand GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
7 Brand GROW CARE OIL 100ML AYURVEDIC MEDICINE Rs.190.00 AAKASH PHARMACEUTICALS
8 Brand RHUNS OIL 30ML AYURVEDIC Rs.50.00 AAKASH PHARMACEUTICALS
9 Brand VILLO CAPSULE 10'S AYURVEDIC MEDICINE Rs.70.00 AAKASH PHARMACEUTICALS
10 Brand VITAWIN FORTE CAPSULE 10'S AYURVEDIC MEDICINE Rs.150.00 AAKASH PHARMACEUTICALS
But unfortunately I got the word "Brand" inserted at coloumn 0 in all rows.
I'm trying to add the header coloumns "Brand", "Generic", "Price", "Company"

Need parameter names in read_csv only:
import pandas as pd
temp=u"""a,b,10,d
e,f,45,r
"""
#after testing replace 'pd.compat.StringIO(temp)' to 'netmedsdb.csv'
df = pd.read_csv(pd.compat.StringIO(temp), names=["Brand", "Generic", "Price", "Company"])
print (df)
Brand Generic Price Company
0 a b 10 d
1 e f 45 r

Related

Check if there is any 12-character word available in the string. If available, extract the word

I have been looking to extract only a 12-character word from the string if it exists.
Need to check if first 5 characters are from a given list and check last 3 character are numbers.
Input data (Data.xlsx):
Description Number
CHQ -AQBCN222Q546 from India Federation Pvt Ltd
CHQN#DJBNK220Q329 from Indiana Basics Software Ltd -BC003
CASH- NJRQC225J987^ from US Fertilizers LLP
CHQ - from India Bulls Pvt Ltd
AQBCN222Q989 from India Bulls Pvt Ltd
CHQ -AQCCN222Q546 from India Federation Pvt Ltd
CASH - AQBCN222Q546289 from India Federation Pvt Ltd
list_Character - ['AQBCN','PUCNQ','DJBNK','ADJBC','NJRQC']
Expected output:
Description Number
CHQ -AQBCN222Q546 from India Federation Pvt Ltd AQBCN222Q546
CHQN#DJBNK220Q329 from Indiana Basics Software Ltd -BC003 DJBNK220Q329
CASH- NJRQC225J987^ from US Fertilizers LLP NJRQC225J987
CHQ - from India Bulls Pvt Ltd
AQBCN222Q989 from India Bulls Pvt Ltd AQBCN222Q989
CHQ -AQCCN222Q546 from India Federation Pvt Ltd
CASH - AQBCN222Q546289 from India Federation Pvt Ltd
Code:
import pandas as pd
import re
df = pd.read_excel(r'D:/Users/Data.xlsx')
list_Character - ['AQBCN','PUCNQ','DJBNK','ADJBC','NJRQC']
regex = r'[#-]((?:' + r'|'.join(list_Character) + r')\w{5})\b'
df["Number"] = df["Description"].str.extract(regex)
I am not finding the solution.
I have tried getting the reference from Check if there is any 10 character word available in the string If Exist Extract the word But it did not work.
You can slightly modify the regex to remove the leading character match and match 7 extra characters:
list_Character = ['AQBCN','PUCNQ','DJBNK','ADJBC','NJRQC']
regex = r'((?:' + r'|'.join(list_Character) + r')\w{7})\b'
df["Number"] = df["Description"].str.extract(regex)
Output:
Description Number
0 CHQ -AQBCN222Q546 from India Federation Pvt Ltd AQBCN222Q546
1 CHQN#DJBNK220Q329 from Indiana Basics Software... DJBNK220Q329
2 CASH- NJRQC225J987^ from US Fertilizers LLP NJRQC225J987
3 CHQ - from India Bulls Pvt Ltd NaN
4 AQBCN222Q989 from India Bulls Pvt Ltd AQBCN222Q989
5 CHQ -AQCCN222Q546 from India Federation Pvt Ltd NaN
6 CASH - AQBCN222Q546289 from India Federation P... NaN

Change the value of a dataframe column to the value of a second column conditional on the value of a third column in pandas

I have data with current names of companies, old names, and the date of name changes. It looks like this:
name
former_name1
name_change_date1
ACMAT CORP
nan
NaT
ACME ELECTRIC CORP
nan
NaT
ACME UNITED CORP
nan
NaT
COLUMBIA ACORN TRUST
LIBERTY ACORN TRUST
2003-10-20
MULTIGRAPHICS INC
AM INTERNATIONAL INC
1997-03-17
MILLER LLOYD I III
nan
NaT
AFFILIATED COMPUTER SERVICES INC
nan
NaT
ADAMS RESOURCES & ENERGY, INC.
ADAMS RESOURCES & ENERGY INC
2005-04-01
BK Technologies Corp
BK Technologies, Inc.
2019-03-28
I want to figure out what the name of each company was at a particular date. Let's say I want to figure out the name of a company as of January 1st 2002. Then I could create a new column called say, edited_name, which would contain the current name of the company unless the company has changed names since 1/1/2002, in which case it would contain the historical name (i.e. former_name1) of the company. So the output should look something like this:
name
former_name1
name_change_date1
edited_name
ACMAT CORP
nan
NaT
ACMAT CORP
ACME ELECTRIC CORP
nan
NaT
ACME ELECTRIC CORP
ACME UNITED CORP
nan
NaT
ACME UNITED CORP
COLUMBIA ACORN TRUST
LIBERTY ACORN TRUST
2003-10-20
LIBERTY ACORN TRUST
MULTIGRAPHICS INC
AM INTERNATIONAL INC
1997-03-17
MULTIGRAPHICS INC
MILLER LLOYD I III
nan
NaT
MILLER LLOYD I III
AFFILIATED COMPUTER SERVICES INC
nan
NaT
AFFILIATED COMPUTER SERVICES INC
ADAMS RESOURCES & ENERGY, INC.
ADAMS RESOURCES & ENERGY INC
2005-04-01
ADAMS RESOURCES & ENERGY INC
BK Technologies Corp
BK Technologies, Inc.
2019-03-28
BK Technologies, Inc.
In Stata (with which I am much more familiar) this could be easily accomplished with:
gen edited_name = name
replace edited_name = former_name1 if name_change_date_1 > date("2002-01-01", "YMD") & name_change_date_1 != .
Unfortunately I am at a loss of how to accomplish this in Python/Pandas.
Data:
{'name': ['ACMAT CORP', 'ACME ELECTRIC CORP', 'ACME UNITED CORP', 'COLUMBIA ACORN TRUST',
'MULTIGRAPHICS INC', 'MILLER LLOYD I III', 'AFFILIATED COMPUTER SERVICES INC',
'ADAMS RESOURCES & ENERGY, INC.', 'BK Technologies Corp'],
'former_name1': [nan, nan, nan, 'LIBERTY ACORN TRUST', 'AM INTERNATIONAL INC', nan, nan,
'ADAMS RESOURCES & ENERGY INC', 'BK Technologies, Inc.'],
'name_change_date1': [NaT, NaT, NaT, '2003-10-20', '1997-03-17', NaT, NaT,
'2005-04-01', '2019-03-28']}
You could use numpy.where to select values depending on if a name change occurred or not:
import numpy as np
df['edited_name'] = np.where(df['name_change_date1'].notna() &
df['name_change_date1'].gt(pd.to_datetime('1/1/2002')),
df['former_name1'], df['name'])
or with mask:
df['edited_name'] = df['name'].mask(df['name_change_date1'].notna() &
df['name_change_date1'].gt(pd.to_datetime('1/1/2002')),
df['former_name1'])
Output:
name former_name1 \
0 ACMAT CORP NaN
1 ACME ELECTRIC CORP NaN
2 ACME UNITED CORP NaN
3 COLUMBIA ACORN TRUST LIBERTY ACORN TRUST
4 MULTIGRAPHICS INC AM INTERNATIONAL INC
5 MILLER LLOYD I III NaN
6 AFFILIATED COMPUTER SERVICES INC NaN
7 ADAMS RESOURCES & ENERGY, INC. ADAMS RESOURCES & ENERGY INC
8 BK Technologies Corp BK Technologies, Inc.
name_change_date1 edited_name
0 NaT ACMAT CORP
1 NaT ACME ELECTRIC CORP
2 NaT ACME UNITED CORP
3 2003-10-20 LIBERTY ACORN TRUST
4 1997-03-17 MULTIGRAPHICS INC
5 NaT MILLER LLOYD I III
6 NaT AFFILIATED COMPUTER SERVICES INC
7 2005-04-01 ADAMS RESOURCES & ENERGY INC
8 2019-03-28 BK Technologies, Inc.
Use:
import numpy as np
df = pd.DataFrame({'name':['a', 'b', 'c', 'd'], 'fname':[np.nan, 'h', 's', np.nan], 'dc':[np.nan, '2003-10-20', '1997-03-17', np.nan]})
df['dc'] = pd.to_datetime(df['dc'])
df['nname'] = df['fname'][df['dc']>'1/1/2002']
res = df['name'][df['nname'].isna()]
temp = df['fname'][df['nname'].notna()]
res = res.append(temp)
df['res']=res
output:

How to get a list of tickers in Jupyter Notebook?

Write code to get a list of tickers for all S&P 500 stocks from Wikipedia. As of 2/24/2021, there are 505 tickers in that list. You can use any method you want as long as the code actually queries the following website to get the list:
https://en.wikipedia.org/wiki/List_of_S%26P_500_companies
One way would be to use the requests module to get the HTML code and then use the re module to extract the tickers. Another option would be the .read_html function in pandas and then export the tickers column to a list.
You should save the tickers in a list with the name sp500_tickers
This will grab the data in the table named 'constituents'.
# find a specific table by table count
import pandas as pd
import requests
from bs4 import BeautifulSoup
res = requests.get("https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")
soup = BeautifulSoup(res.content,'lxml')
table = soup.find_all('table')[0]
df = pd.read_html(str(table))
print(df[0].to_json(orient='records'))
Result:
[{"Symbol":"MMM","Security":"3M Company","SEC filings":"reports","GICS Sector":"Industrials","GICS Sub-Industry":"Industrial Conglomerates","Headquarters Location":"St. Paul, Minnesota","Date first added":"1976-08-09","CIK":66740,"Founded":"1902"},{"Symbol":"ABT","Security":"Abbott Laboratories","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Health Care Equipment","Headquarters Location":"North Chicago, Illinois","Date first added":"1964-03-31","CIK":1800,"Founded":"1888"},{"Symbol":"ABBV","Security":"AbbVie Inc.","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Pharmaceuticals","Headquarters Location":"North Chicago, Illinois","Date first added":"2012-12-31","CIK":1551152,"Founded":"2013 (1888)"},{"Symbol":"ABMD","Security":"Abiomed","SEC filings":"reports","GICS Sector":"Health Care","GICS Sub-Industry":"Health Care Equipment","Headquarters Location":"Danvers, Massachusetts","Date first added":"2018-05-31","CIK":815094,"Founded":"1981"},{"Symbol":"ACN","Security":"Accenture","SEC filings":"reports","GICS Sector":"Information Technology","GICS Sub-Industry":"IT Consulting & Other Services","Headquarters Location":"Dublin, Ireland","Date first added":"2011-07-06","CIK":1467373,"Founded":"1989"},{"Symbol":"ATVI","Security":"Activision Blizzard","SEC filings":"reports","GICS Sector":"Communication Services","GICS Sub-Industry":"Interactive Home Entertainment","Headquarters Location":"Santa Monica, California","Date first added":"2015-08-31","CIK":718877,"Founded":"2008"},{"Symbol":"ADBE","Security":"Adobe Inc.","SEC filings":"reports","GICS Sector":"Information Technology","GICS Sub-Industry":"Application Software","Headquarters Location":"San Jose, California","Date first added":"1997-05-05","CIK":796343,"Founded":"1982"},
Etc., etc., etc.
That's JSON. If you want a table, kind of like what you would use in Excel, simply print the df.
Result:
[ Symbol Security SEC filings GICS Sector \
0 MMM 3M Company reports Industrials
1 ABT Abbott Laboratories reports Health Care
2 ABBV AbbVie Inc. reports Health Care
3 ABMD Abiomed reports Health Care
4 ACN Accenture reports Information Technology
.. ... ... ... ...
500 YUM Yum! Brands Inc reports Consumer Discretionary
501 ZBRA Zebra Technologies reports Information Technology
502 ZBH Zimmer Biomet reports Health Care
503 ZION Zions Bancorp reports Financials
504 ZTS Zoetis reports Health Care
GICS Sub-Industry Headquarters Location \
0 Industrial Conglomerates St. Paul, Minnesota
1 Health Care Equipment North Chicago, Illinois
2 Pharmaceuticals North Chicago, Illinois
3 Health Care Equipment Danvers, Massachusetts
4 IT Consulting & Other Services Dublin, Ireland
.. ... ...
500 Restaurants Louisville, Kentucky
501 Electronic Equipment & Instruments Lincolnshire, Illinois
502 Health Care Equipment Warsaw, Indiana
503 Regional Banks Salt Lake City, Utah
504 Pharmaceuticals Parsippany, New Jersey
Date first added CIK Founded
0 1976-08-09 66740 1902
1 1964-03-31 1800 1888
2 2012-12-31 1551152 2013 (1888)
3 2018-05-31 815094 1981
4 2011-07-06 1467373 1989
.. ... ... ...
500 1997-10-06 1041061 1997
501 2019-12-23 877212 1969
502 2001-08-07 1136869 1927
503 2001-06-22 109380 1873
504 2013-06-21 1555280 1952
[505 rows x 9 columns]]
Alternatively, you can export the df to a CSV file.
df.to_csv('constituents.csv')

scrappig an HTML tag on the web page using BS

Image description is:
Tag is:
I was looking to get the data in the stock dropdown. I went into the source and found the tag but I can't get the code to access the data. Can someone please help me fix the bug?
url ="http://www.moneycontrol.com/india/fnoquote/reliance-industries/RI/2020-07-30"
from bs4 import BeautifulSoup
import pandas as pd
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, "html.parser")
for i in soup.select("stock_id"):
print(i.text)
You can use #stock_code > option instead of stock_id to get the data in the stock dropdown.You can try it:
url ="http://www.moneycontrol.com/india/fnoquote/reliance-industries/RI/2020-07-30"
headers = {'User-Agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
from bs4 import BeautifulSoup
import requests
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, "html.parser")
a = soup.select("#stock_code > option")
for i in a:
print(i.text)
Output will be:
ACC
Adani Enterpris
Adani Ports
Adani Power
Ajanta Pharma
Allahabad Bank
Amara Raja Batt
Ambuja Cements
Apollo Hospital
Apollo Tyres
Arvind
Ashok Leyland
Asian Paints
Aurobindo Pharm
Axis Bank
Bajaj Auto
Bajaj Finance
Bajaj Finserv
Balkrishna Ind
Bank of Baroda
Bank of India
Bata India
BEML
Berger Paints
Bharat Elec
Bharat Fin
Bharat Forge
Bharti Airtel
Bharti Infratel
BHEL
Biocon
Bosch
BPCL
Britannia
Cadila Health
Can Fin Homes
Canara Bank
Capital First
Castrol
Ceat
Century
CESC
CG Power
Chennai Petro
Cholamandalam
Cipla
Coal India
Colgate
Container Corp
Cummins
Dabur India
Dalmia Bharat
DCB Bank
Dewan Housing
Dish TV
Divis Labs
DLF
Dr Reddys Labs
Eicher Motors
EngineersInd
Equitas Holding
Escorts
Exide Ind
Federal Bank
GAIL
Glenmark
GMR Infra
Godfrey Phillip
Godrej Consumer
Godrej Ind
Granules India
Grasim
GSFC
Havells India
HCL Tech
HDFC
HDFC Bank
Hero Motocorp
Hexaware Tech
Hind Constr
Hind Zinc
Hindalco
HPCL
HUL
ICICI Bank
ICICI Prudentia
IDBI Bank
IDFC
IDFC Bank
IFCI
IGL
India Cements
Indiabulls Hsg
Indian Bank
IndusInd Bank
Infibeam Avenue
Infosys
Interglobe Avi
IOC
IRB Infra
ITC
Jain Irrigation
Jaiprakash Asso
Jet Airways
Jindal Steel
JSW Steel
Jubilant Food
Just Dial
Kajaria Ceramic
Karnataka Bank
Kaveri Seed
Kotak Mahindra
KPIT Tech
L&T Finance
Larsen
LIC Housing Fin
Lupin
M&M
M&M Financial
Mahanagar Gas
Manappuram Fin
Marico
Maruti Suzuki
Max Financial
MCX India
Mindtree
Motherson Sumi
MRF
MRPL
Muthoot Finance
NALCO
NBCC (India)
NCC
Nestle
NHPC
NIIT Tech
NMDC
NTPC
Oil India
ONGC
Oracle Fin Serv
Oriental Bank
Page Industries
PC Jeweller
Petronet LNG
Pidilite Ind
Piramal Enter
PNB
Power Finance
Power Grid Corp
PTC India
PVR
Ramco Cements
Raymond
RBL Bank
REC
Rel Capital
Reliance
Reliance Comm
Reliance Infra
Reliance Power
Repco Home
SAIL
SBI
Shree Cements
Shriram Trans
Siemens
South Ind Bk
SREI Infra
SRF
Strides Pharma
Sun Pharma
Sun TV Network
Suzlon Energy
Syndicate Bank
Tata Chemicals
Tata Comm
Tata Elxsi
Tata Global Bev
Tata Motors
Tata Motors (D)
Tata Power
Tata Steel
TCS
Tech Mahindra
Titan Company
Torrent Pharma
Torrent Power
TV18 Broadcast
TVS Motor
Ujjivan Financi
UltraTechCement
Union Bank
United Brewerie
United Spirits
UPL
V-Guard Ind
Vedanta
Vodafone Idea
Voltas
Wipro
Wockhardt
Yes Bank
Zee Entertain
Select
ACC
Adani Enterpris
Adani Ports
Adani Power
Ajanta Pharma
Allahabad Bank
Amara Raja Batt
Ambuja Cements
Apollo Hospital
Apollo Tyres
Arvind
Ashok Leyland
Asian Paints
Aurobindo Pharm
Axis Bank
Bajaj Auto
Bajaj Finance
Bajaj Finserv
Balkrishna Ind
Bank of Baroda
Bank of India
Bata India
BEML
Berger Paints
Bharat Elec
Bharat Fin
Bharat Forge
Bharti Airtel
Bharti Infratel
BHEL
Biocon
Bosch
BPCL
Britannia
Cadila Health
Can Fin Homes
Canara Bank
Capital First
Castrol
Ceat
Century
CESC
CG Power
Chennai Petro
Cholamandalam
Cipla
Coal India
Colgate
Container Corp
Cummins
Dabur India
Dalmia Bharat
DCB Bank
Dewan Housing
Dish TV
Divis Labs
DLF
Dr Reddys Labs
Eicher Motors
EngineersInd
Equitas Holding
Escorts
Exide Ind
Federal Bank
GAIL
Glenmark
GMR Infra
Godfrey Phillip
Godrej Consumer
Godrej Ind
Granules India
Grasim
GSFC
Havells India
HCL Tech
HDFC
HDFC Bank
Hero Motocorp
Hexaware Tech
Hind Constr
Hind Zinc
Hindalco
HPCL
HUL
ICICI Bank
ICICI Prudentia
IDBI Bank
IDFC
IDFC Bank
IFCI
IGL
India Cements
Indiabulls Hsg
Indian Bank
IndusInd Bank
Infibeam Avenue
Infosys
Interglobe Avi
IOC
IRB Infra
ITC
Jain Irrigation
Jaiprakash Asso
Jet Airways
Jindal Steel
JSW Steel
Jubilant Food
Just Dial
Kajaria Ceramic
Karnataka Bank
Kaveri Seed
Kotak Mahindra
KPIT Tech
L&T Finance
Larsen
LIC Housing Fin
Lupin
M&M
M&M Financial
Mahanagar Gas
Manappuram Fin
Marico
Maruti Suzuki
Max Financial
MCX India
Mindtree
Motherson Sumi
MRF
MRPL
Muthoot Finance
NALCO
NBCC (India)
NCC
Nestle
NHPC
NIIT Tech
NMDC
NTPC
Oil India
ONGC
Oracle Fin Serv
Oriental Bank
Page Industries
PC Jeweller
Petronet LNG
Pidilite Ind
Piramal Enter
PNB
Power Finance
Power Grid Corp
PTC India
PVR
Ramco Cements
Raymond
RBL Bank
REC
Rel Capital
Reliance
Reliance Comm
Reliance Infra
Reliance Power
Repco Home
SAIL
SBI
Shree Cements
Shriram Trans
Siemens
South Ind Bk
SREI Infra
SRF
Strides Pharma
Sun Pharma
Sun TV Network
Suzlon Energy
Syndicate Bank
Tata Chemicals
Tata Comm
Tata Elxsi
Tata Global Bev
Tata Motors
Tata Motors (D)
Tata Power
Tata Steel
TCS
Tech Mahindra
Titan Company
Torrent Pharma
Torrent Power
TV18 Broadcast
TVS Motor
Ujjivan Financi
UltraTechCement
Union Bank
United Brewerie
United Spirits
UPL
V-Guard Ind
Vedanta
Vodafone Idea
Voltas
Wipro
Wockhardt
Yes Bank
Zee Entertain

Python3 Pandas Dataframe Split Columns

I have a column on my dataframe that contains the following
Wal-Mart Stores, Inc., Clinton, IA 52732
Benton Packing, LLC, Clearfield, UT 84016
North Coast Iron Corp, Seattle, WA 98109
Messer Construction Co. Inc., Amarillo, TX 79109
Ocean Spray Cranberries, Inc., Henderson, NV 89011
W R Derrick & Co. Lexington, SC 29072
I am having problem to capture it using regex so far my regex works for first 2 lines:
[A-Z][A-za-z-\s]+,\s{1}(Inc.|LLC)
How do I split the column to 4 additional columns? i.e. Column1 = Company Name, Column 2 = City, Column 3 = State, Column 4 = Zipcode.
Example of the output is shown below:
Company_Name City State ZipCode
Wal-Mart Stores, Inc. Clinton IA 52732
The names are probably the trickiest part, but if you know that the structure of city, state, zip will always be the same (i.e. no extra commas) you could use rsplit to split the strings. Similarly pandas has a str.rsplit method as well.
df
Address
0 Wal-Mart Stores, Inc., Clinton, IA 52732
1 Benton Packing, LLC, Clearfield, UT 84016
2 North Coast Iron Corp, Seattle, WA 98109
3 Messer Construction Co. Inc., Amarillo, TX 79109
df['Zip'] = df.Address.map(lambda x: x.rsplit(' ', 1)[-1])
df['Name'], df['City'], df['State']= zip(*df.Address.map(lambda x: x.rsplit(' ', 1)[0].rsplit(',', 2)))
df
Address Zip \
0 Wal-Mart Stores, Inc., Clinton, IA 5273 5273
1 Benton Packing, LLC, Clearfield, UT 84016 84016
2 North Coast Iron Corp, Seattle, WA 98109 98109
3 Messer Construction Co. Inc., Amarillo, TX 79109 79109
Name City State
0 Wal-Mart Stores, Inc. Clinton IA
1 Benton Packing, LLC Clearfield UT
2 North Coast Iron Corp Seattle WA
3 Messer Construction Co. Inc. Amarillo TX

Categories

Resources