I have a column on my dataframe that contains the following
Wal-Mart Stores, Inc., Clinton, IA 52732
Benton Packing, LLC, Clearfield, UT 84016
North Coast Iron Corp, Seattle, WA 98109
Messer Construction Co. Inc., Amarillo, TX 79109
Ocean Spray Cranberries, Inc., Henderson, NV 89011
W R Derrick & Co. Lexington, SC 29072
I am having problem to capture it using regex so far my regex works for first 2 lines:
[A-Z][A-za-z-\s]+,\s{1}(Inc.|LLC)
How do I split the column to 4 additional columns? i.e. Column1 = Company Name, Column 2 = City, Column 3 = State, Column 4 = Zipcode.
Example of the output is shown below:
Company_Name City State ZipCode
Wal-Mart Stores, Inc. Clinton IA 52732
The names are probably the trickiest part, but if you know that the structure of city, state, zip will always be the same (i.e. no extra commas) you could use rsplit to split the strings. Similarly pandas has a str.rsplit method as well.
df
Address
0 Wal-Mart Stores, Inc., Clinton, IA 52732
1 Benton Packing, LLC, Clearfield, UT 84016
2 North Coast Iron Corp, Seattle, WA 98109
3 Messer Construction Co. Inc., Amarillo, TX 79109
df['Zip'] = df.Address.map(lambda x: x.rsplit(' ', 1)[-1])
df['Name'], df['City'], df['State']= zip(*df.Address.map(lambda x: x.rsplit(' ', 1)[0].rsplit(',', 2)))
df
Address Zip \
0 Wal-Mart Stores, Inc., Clinton, IA 5273 5273
1 Benton Packing, LLC, Clearfield, UT 84016 84016
2 North Coast Iron Corp, Seattle, WA 98109 98109
3 Messer Construction Co. Inc., Amarillo, TX 79109 79109
Name City State
0 Wal-Mart Stores, Inc. Clinton IA
1 Benton Packing, LLC Clearfield UT
2 North Coast Iron Corp Seattle WA
3 Messer Construction Co. Inc. Amarillo TX
Related
i have a df where i have a requirement to filter it into new df and work on it and after working i wanted to update it to the original df like.
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca - 92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In-46514
my intened output is
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca
92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In
46514
So firstly i filtered the original dataframe into a new df and worked on it.
with following code
Filter3_df= Final[Final['State'].isnull()]
Filter3_df['temp'] = Filter3_df['City'].str.extract('([A-Za-z]+)')
mask2= Filter3_df['temp'].notnull()
Filter3_df.loc[mask2, 'Zip'] = Filter3_df.loc[mask2, 'City'].str[-5:]
Filter3_df.loc[mask2, 'State'] = Filter3_df.loc[mask2, 'temp']
del Filter3_df['temp']
Filter3_df['City']= float('NaN')
after this the table for Filter3_df looks like this
Street
City
State
Zip
9810 Scripps Lake Dr. Suite A San Diego
Ca
92131
4000 E Bristol St Ste 3 Elkhart
In
46514
but when i update this filtered_df back to the original df using
Final.update(Filter3_df)
I am not getting the intended output instead I am getting the output as
Street
City
State
Zip
4210 Nw Lake Dr
Lees Summit
Mo
64064
9810 Scripps Lake Dr. Suite A San Diego
Ca - 92131
Ca
92131
1124 Ethel St
Glendale
Ca
91207
4000 E Bristol St Ste 3 Elkhart
In-46514
In
46514
kindly let me know where am i going wrong.
From the docs, pandas.DataFrame.update:
Modify in place using non-NA values from another DataFrame.
Replace Filter3_df['City']= float('NaN'), which is NA for floats, with the value you really want:
Filter3_df['City'] = ""
so I have 2 CSV files in file1 I have list of research groups names. in file2 I have list of the Research full name with location as wall. I want to join these 2 csv file if the have the words matches in them.
in file1.cvs
research_groups_names
Chinese Academy of Sciences (CAS)
CAS
U-M
UQ
in file2.cvs
research_groups_names
Location
Chinese Academy of Sciences (CAS)
China
University of Michigan (U-M)
United States of America (USA)
The University of Queensland (UQ)
Australia
the Output.csv
f1_research_groups_names
f2_research_groups_names
Location
Chinese Academy of Sciences
Chinese Academy ofSciences(CAS)
China
CAS
Chinese Academy of Sciences (CAS)
China
U-M
University of Michigan (U-M)
United States of America(USA)
UQ
The University of Queensland (UQ)
Australia
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file1.csv')
df1 = df1.add_prefix('f1_')
df2 = df2.add_prefix('f2_')
def compare_nae(df):
if df1['f1_research_groups_names'] == df2['f2_research_groups_names']:
return 1
else:
return 0
result = pd.merge(df1, df2, left_on=['f1_research_groups_names'],right_on=['f2_research_groups_names'], how="left")
result.to_csv('output.csv')
You can try:
def fn(row):
for _, n in df2.iterrows():
if (
n["research_groups_names"] == row["research_groups_names"]
or row["research_groups_names"] in n["research_groups_names"]
):
return n
df1[["f2_research_groups_names", "f2_location"]] = df1.apply(fn, axis=1)
df1 = df1.rename(columns={"research_groups_names": "f1_research_groups_names"})
print(df1)
Prints:
f1_research_groups_names f2_research_groups_names f2_location
0 Chinese Academy of Sciences (CAS) Chinese Academy of Sciences (CAS) China
1 CAS Chinese Academy of Sciences (CAS) China
2 U-M University of Michigan (U-M) United States of America (USA)
3 UQ The University of Queensland (UQ) Australia
Note: If in df1 is name not found in df2 there will be None, None in columns "f2_research_groups_names" and "f2_location"
I have data with current names of companies, old names, and the date of name changes. It looks like this:
name
former_name1
name_change_date1
ACMAT CORP
nan
NaT
ACME ELECTRIC CORP
nan
NaT
ACME UNITED CORP
nan
NaT
COLUMBIA ACORN TRUST
LIBERTY ACORN TRUST
2003-10-20
MULTIGRAPHICS INC
AM INTERNATIONAL INC
1997-03-17
MILLER LLOYD I III
nan
NaT
AFFILIATED COMPUTER SERVICES INC
nan
NaT
ADAMS RESOURCES & ENERGY, INC.
ADAMS RESOURCES & ENERGY INC
2005-04-01
BK Technologies Corp
BK Technologies, Inc.
2019-03-28
I want to figure out what the name of each company was at a particular date. Let's say I want to figure out the name of a company as of January 1st 2002. Then I could create a new column called say, edited_name, which would contain the current name of the company unless the company has changed names since 1/1/2002, in which case it would contain the historical name (i.e. former_name1) of the company. So the output should look something like this:
name
former_name1
name_change_date1
edited_name
ACMAT CORP
nan
NaT
ACMAT CORP
ACME ELECTRIC CORP
nan
NaT
ACME ELECTRIC CORP
ACME UNITED CORP
nan
NaT
ACME UNITED CORP
COLUMBIA ACORN TRUST
LIBERTY ACORN TRUST
2003-10-20
LIBERTY ACORN TRUST
MULTIGRAPHICS INC
AM INTERNATIONAL INC
1997-03-17
MULTIGRAPHICS INC
MILLER LLOYD I III
nan
NaT
MILLER LLOYD I III
AFFILIATED COMPUTER SERVICES INC
nan
NaT
AFFILIATED COMPUTER SERVICES INC
ADAMS RESOURCES & ENERGY, INC.
ADAMS RESOURCES & ENERGY INC
2005-04-01
ADAMS RESOURCES & ENERGY INC
BK Technologies Corp
BK Technologies, Inc.
2019-03-28
BK Technologies, Inc.
In Stata (with which I am much more familiar) this could be easily accomplished with:
gen edited_name = name
replace edited_name = former_name1 if name_change_date_1 > date("2002-01-01", "YMD") & name_change_date_1 != .
Unfortunately I am at a loss of how to accomplish this in Python/Pandas.
Data:
{'name': ['ACMAT CORP', 'ACME ELECTRIC CORP', 'ACME UNITED CORP', 'COLUMBIA ACORN TRUST',
'MULTIGRAPHICS INC', 'MILLER LLOYD I III', 'AFFILIATED COMPUTER SERVICES INC',
'ADAMS RESOURCES & ENERGY, INC.', 'BK Technologies Corp'],
'former_name1': [nan, nan, nan, 'LIBERTY ACORN TRUST', 'AM INTERNATIONAL INC', nan, nan,
'ADAMS RESOURCES & ENERGY INC', 'BK Technologies, Inc.'],
'name_change_date1': [NaT, NaT, NaT, '2003-10-20', '1997-03-17', NaT, NaT,
'2005-04-01', '2019-03-28']}
You could use numpy.where to select values depending on if a name change occurred or not:
import numpy as np
df['edited_name'] = np.where(df['name_change_date1'].notna() &
df['name_change_date1'].gt(pd.to_datetime('1/1/2002')),
df['former_name1'], df['name'])
or with mask:
df['edited_name'] = df['name'].mask(df['name_change_date1'].notna() &
df['name_change_date1'].gt(pd.to_datetime('1/1/2002')),
df['former_name1'])
Output:
name former_name1 \
0 ACMAT CORP NaN
1 ACME ELECTRIC CORP NaN
2 ACME UNITED CORP NaN
3 COLUMBIA ACORN TRUST LIBERTY ACORN TRUST
4 MULTIGRAPHICS INC AM INTERNATIONAL INC
5 MILLER LLOYD I III NaN
6 AFFILIATED COMPUTER SERVICES INC NaN
7 ADAMS RESOURCES & ENERGY, INC. ADAMS RESOURCES & ENERGY INC
8 BK Technologies Corp BK Technologies, Inc.
name_change_date1 edited_name
0 NaT ACMAT CORP
1 NaT ACME ELECTRIC CORP
2 NaT ACME UNITED CORP
3 2003-10-20 LIBERTY ACORN TRUST
4 1997-03-17 MULTIGRAPHICS INC
5 NaT MILLER LLOYD I III
6 NaT AFFILIATED COMPUTER SERVICES INC
7 2005-04-01 ADAMS RESOURCES & ENERGY INC
8 2019-03-28 BK Technologies, Inc.
Use:
import numpy as np
df = pd.DataFrame({'name':['a', 'b', 'c', 'd'], 'fname':[np.nan, 'h', 's', np.nan], 'dc':[np.nan, '2003-10-20', '1997-03-17', np.nan]})
df['dc'] = pd.to_datetime(df['dc'])
df['nname'] = df['fname'][df['dc']>'1/1/2002']
res = df['name'][df['nname'].isna()]
temp = df['fname'][df['nname'].notna()]
res = res.append(temp)
df['res']=res
output:
I'm learning how to scrape using Beautiful soup with selenium and I found a website that has multiple tables and found table tags (first time dealing with them). I'm learning how to try to scrape those texts from each table and append each element to respected list. First im trying to scrape the first table, and the rest I want to do on my own. But I cannot access the tag for some reason.
I also incorporated selenium to access the sites, because when I copy the link to the site onto another tab, the list of tables disappears, for some reason.
My code so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
try:
page = requests.get(targetSite )
soup = BeautifulSoup(page.text, 'html.parser')
items = soup.find_all('table', {"class":"popdetail"})
for i in items:
event_title.append(item.find('b', {'class': "text"})).text.strip()
name.append(item.find('td', {'class': "text"})).text.strip()
address.append(item.find('td', {'class': "text"})).text.strip()
city.append(item.find('td', {'class': "text"})).text.strip()
state.append(item.find('td', {'class': "text"})).text.strip()
zipCode.append(item.find('td', {'class': "text"})).text.strip()
Can someone let me know if I am doing this correctly, This is my first time dealing with site's urls elements disappear when copied onto a new tab and/or window
So far, I am unable to append any information to each list.
One issue is with the for loop.
you have for i in items:, but then you are calling item instead of i.
And secondly, if you are using selenium to render the page, then you should probably use selenium to get the html. They also have some embedded tables within tables, so it's not as straight forward as iterating through the <table> tags. What I ended up doing was having pandas read in the tables (returns a list of dataframes), then iterating through those as there is a pattern of how the dataframes are constructed.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import Select
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
targetSite = "https://www.sdvisualarts.net/sdvan_new/events.php"
driver.get(targetSite)
select_event = Select(driver.find_element_by_name('subs'))
select_event.select_by_value('All')
select_loc = Select(driver.find_element_by_name('loc'))
select_loc.select_by_value("All")
driver.find_element_by_name("submit").click()
targetSite = "https://www.sdvisualarts.net/sdvan_new/viewevents.php"
event_title = []
name = []
address = []
city = []
state = []
zipCode = []
location = []
webSite = []
fee = []
event_dates = []
opening_dates = []
description = []
dfs = pd.read_html(driver.page_source)
driver.close
for idx, table in enumerate(dfs):
if table.iloc[0,0] == 'Event Title':
event_title.append(table.iloc[-1,0])
tempA = dfs[idx+1]
tempA.index = tempA[0]
tempB = dfs[idx+4]
tempB.index = tempB[0]
tempC = dfs[idx+5]
tempC.index = tempC[0]
name.append(tempA.loc['Name',1])
address.append(tempA.loc['Address',1])
city.append(tempA.loc['City',1])
state.append(tempA.loc['State',1])
zipCode.append(tempA.loc['Zip',1])
location.append(tempA.loc['Location',1])
webSite.append(tempA.loc['Web Site',1])
fee.append(tempB.loc['Fee',1])
event_dates.append(tempB.loc['Dates',1])
opening_dates.append(tempB.loc['Opening Days',1])
description.append(tempC.loc['Event Description',1])
df = pd.DataFrame({'event_title':event_title,
'name':name,
'address':address,
'city':city,
'state':state,
'zipCode':zipCode,
'location':location,
'webSite':webSite,
'fee':fee,
'event_dates':event_dates,
'opening_dates':opening_dates,
'description':description})
Output:
print (df.to_string())
event_title name address city state zipCode location webSite fee event_dates opening_dates description
0 The San Diego Museum of Art Welcomes a Special... San Diego Museum of Art 1450 El Prado, Balboa Park San Diego CA 92101 Central San Diego https://www.sdmart.org/ NaN Starts On 6-18-2020 Ends On 1-10-2021 Opens virtually on June 18. The work will beco... The San Diego Museum of Art is launching its f...
1 New Exhibit: Miller Dairy Remembered Lemon Grove Historical Society 3185 Olive Street, Treganza Heritage Park Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Children 12 and under free and must be accompa... Starts On 6-27-2020 Ends On 12-4-2020 Exhibit on view Saturdays 11 am to 2 pm; close... From 1926 there were cows smack in the midst o...
2 Gizmos and Shivelight Distinction Gallery 317 E. Grand Ave Escondido CA 92025 North County Inland http://www.distinctionart.com NaN Starts On 7-14-2020 Ends On 9-5-2020 08/08/20 - 09/05/20 Distinction Gallery is proud to present our so...
3 Virtual Opening - July Exhibitions Vision Art Museum 2825 Dewey Rd. Suite 100 San Diego CA 92106 Central San Diego http://www.visionsartmuseum.org Free Starts On 7-18-2020 Ends On 10-4-2020 NaN Join Visions Art Museum for a virtual exhibiti...
4 Laying it Bare: The Art of Walter Redondo and ... Fresh Paint Gallery 1020-B Prospect Street La Jolla CA 92037 Central San Diego http://freshpaintgallery.com/ NaN Starts On 8-1-2020 Ends On 9-27-2020 Tuesday through Sunday. Mondays closed. A two-person exhibit of new abstract expressio...
5 Online oil painting lessons with Concetta Antico NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 8-10-2020 Ends On 8-31-2020 NaN Anyone can learn to paint like the masters! Ov...
6 MOMENTUM: A Creative Industry Symposium Vanguard Culture Via Zoom San Diego California 92101 Virtual https://www.eventbrite.com/e/momentum-a-creati... $10 suggested donation Starts On 8-17-2020 Ends On 9-7-2020 NaN MOMENTUM: A Creative Industry Symposium Monday...
7 Virtual Locals Invitational Show Art & Frames of Coronado 936 ORANGE AVE Coronado CA 92118 0 https://www.artsteps.com/view/5eed0ad62cd0d65b... free Starts On 8-21-2020 Ends On 8-1-2021 NaN Art and Frames of Coronado invites you to our ...
8 HERE & Now R.B. Stevenson Gallery 7661 Girard Avenue, Suite 101 La Jolla California 92037 Central San Diego http://www.rbstevensongallery.com Free Starts On 8-22-2020 Ends On 9-25-2020 Tuesday through Saturday R.B.Stevenson Gallery is pleased to announce t...
9 Art Unites Learning: Normal 2.0 Art Unites NaN San Diego NaN 92116 Central San Diego https://www.facebook.com/events/956878098104971 Free Starts On 8-25-2020 Ends On 8-25-2020 NaN Please join us on Tuesday, August 25th as we: ...
10 Image Quest Sojourn; Visual Journaling for Per... Pamela Underwood Studios Virtual NaN NaN NaN Virtual http://www.pamelaunderwood.com/event/new-onlin... $595.00 Starts On 8-26-2020 Ends On 11-11-2020 NaN Create a personal Image Quest resource journal...
11 Behind The Exhibition: Southern California Con... Oceanside Museum of Art 704 Pier View Way Oceanside California 92054 Virtual https://oma-online.org/events/behind-the-exhib... No fee required. Donations recommended. Starts On 8-27-2020 Ends On 8-27-2020 NaN Join curator Beth Smith and exhibitions manage...
12 Lay it on Thick, a Virtual Art Exhibition San Diego Watercolor Society 2825 Dewey Rd Bldg #202 San Diego California 92106 0 https://www.sdws.org NaN Starts On 8-30-2020 Ends On 9-26-2020 NaN The San Diego Watercolor Society proudly prese...
13 The Forum: Marketing & Branding for Creatives Vanguard Culture Via Zoom San Diego CA 92101 South San Diego http://vanguardculture.com/ $5 suggested donation Starts On 9-1-2020 Ends On 9-1-2020 NaN Attention creative industry professionals! Joi...
14 Write or Die Solo Exhibition You Belong Here 3619 EL CAJON BLVD San Diego CA 92104 Central San Diego http://www.youbelongsd.com/upcoming-events/wri... $10 donation to benefit You Belong Here Starts On 9-4-2020 Ends On 9-6-2020 NaN Write or Die is an immersive installation and ...
15 SDVAN presents Art San Diego at Bread and Salt San Diego Visual Arts Network 1955 Julian Avenue San Digo CA 92113 Central San Diego http://www.sdvisualarts.net and https://www.br... Free Starts On 9-5-2020 Ends On 10-24-2020 NaN We are pleased to announce the four artist rec...
16 The Coming of Treganza Heritage Park Lemon Grove Historical Society 3185 Olive Street Lemon Grove CA 91945 Central San Diego http://www.lghistorical.org Free for all ages Starts On 9-10-2020 Ends On 9-10-2020 The park is open daily, 8 am to 8 pm. Covid 19... Lemon Grove\'s central city park will be renam...
17 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 9-14-2020 Ends On 10-5-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
18 Online oil painting course | 4 weeks NaN NaN NaN NaN NaN Virtual http://concettaantico.com/live-online-oil-pain... NaN Starts On 10-12-2020 Ends On 11-2-2020 NaN Over 4 weekly Zoom lessons, learn the techniqu...
19 36th Annual Mission Fed ArtWalk Mission Fed ArtWalk Ash Street San Diego California 92101 Central San Diego www.missionfedartwalk.org Free Starts On 11-7-2020 Ends On 11-8-2020 Sat and Sun Nov 7 and 8 Mission Fed ArtWalk returns to San Diego’s Lit...
20 Mingei Pop Up Workshop: My Daruma Doll New Childrens Museum 200 West Island Avenue San Diego California 92101 Central San Diego http://thinkplaycreate.org/ Free with admission Starts On 11-13-2020 Ends On 11-13-2020 NaN Join Mingei International Museum at The New Ch...
I've got a list of addresses in a single column address, how would I go about parsing the phone number and restaurant category into new columns? My dataframe looks like this
address
0 Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 310-246-1501 Steakhouses
1 Art's Deli 12224 Ventura Blvd. Studio City 818-762-1221 Delis
2 Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 French Bistro
where I want to get
address | phone_number | category
0 Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles | 310-246-1501 | Steakhouses
1 Art's Deli 12224 Ventura Blvd. Studio City | 818-762-1221 | Delis
2 Bel-Air Hotel 701 Stone Canyon Rd. Bel Air | 310-472-1211 | French Bistro
Does anybody have any suggestions?
Try using Regex with str.extract.
Ex:
df = pd.DataFrame({'address':["Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 310-246-1501 Steakhouses",
"Art's Deli 12224 Ventura Blvd. Studio City 818-762-1221 Delis",
"Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 French Bistro"]})
df[["address", "phone_number", "category"]] = df["address"].str.extract(r"(?P<address>.*?)(?P<phone_number>\b\d{3}\-\d{3}\-\d{4}\b)(?P<category>.*$)")
print(df)
Output:
address phone_number \
0 Arnie Morton's of Chicago 435 S. La Cienega Bl... 310-246-1501
1 Art's Deli 12224 Ventura Blvd. Studio City 818-762-1221
2 Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211
category
0 Steakhouses
1 Delis
2 French Bistro
Note:: Assuming the content of address is always address--phone_number--category
Using str.extract and str.split:
We extract the pattern numbers dash numbers dash numbers for phone_number
We split on the pattern 3 numbers followed by a space and grab the part after it for category. We use positive lookbehind for this, which is ?<= in regex
df['phone_number'] = df['address'].str.extract('(\d+-\d+-\d+)')
df['category'] = df['address'].str.split('(?<=\d{3})\s').str[-1]
Output
address phone_number category
0 Arnie Morton's of Chicago 435 S. La Cienega Blvd. Los Angeles 310-246-1501 Steakhouses 310-246-1501 Steakhouses
1 Art's Deli 12224 Ventura Blvd. Studio City 818-762-1221 Delis 818-762-1221 Delis
2 Bel-Air Hotel 701 Stone Canyon Rd. Bel Air 310-472-1211 French Bistro 310-472-1211 French Bistro