Scraping Address Information Using Selenium in Python - python

I am trying to scrape address information from https://www.smartystreets.com/products/single-address-iframe. I have a script that searches for the given address in its parameters. When I look at the website itself, one can see various fields like Carrier Route.
Using 3301 South Greenfield Rd Gilbert, AZ 85297 as a hypothetical example, when one goes to the page manually, one can see the Carrier Route: R109.
I am having trouble, however, finding the carrier route on Selenium to scrape it. Does have any recommendations for how to find the Carrier Route for any given address?
Starting code:
driver = webdriver.Chrome('chromedriver')
address = "3301 South Greenfield Rd Gilbert, AZ 85297\n"
url = 'https://www.smartystreets.com/products/single-address-iframe'
driver.get(url)
driver.find_element_by_id("lookup-select-button").click()
driver.find_element_by_id("lookup-select").find_element_by_id("address-freeform").click()
driver.find_element_by_id("freeform-address").send_keys(address)
# Find Carrier Route here

You can use driver.execute_script to provide input for the fields and to click the submission button:
from selenium import webdriver
d = webdriver.Chrome('/path/to/chromedriver')
d.get('https://www.smartystreets.com/products/single-address-iframe')
s = '3301 South Greenfield Rd Gilbert, AZ 85297'
a, a1 = s.split(' Rd ')
route = d.execute_script(f'''
document.querySelector('#address-line1').value = '{a}'
document.querySelector('#city').value = '{(j:=a1.split())[0][:-1]}'
document.querySelector('#state').value = '{j[1]}'
document.querySelector('#zip-code').value = '{j[2]}'
document.querySelector('#submit-request').click()
return document.querySelector('#us-street-metadata li:nth-of-type(2) .answer.col-sm-5.col-xs-3').textContent
''')
Output:
'R109'
To get a full display of all the parameter data, you can use BeautifulSoup:
from bs4 import BeautifulSoup as soup
... #selenium driver source here
cols = soup(d.page_source, 'html.parser').select('#us-street-output div')
data = {i.h4.text:{b.select_one('span:nth-of-type(1)').get_text(strip=True)[:-1]:b.select_one('span:nth-of-type(2)').get_text(strip=True)
for b in i.select('ul li')} for i in cols}
print(data)
print(data['Metadata']['Congressional District'])
Output:
{'Metadata': {'Building Default': 'default', 'Carrier Route': 'R109', 'Congressional District': '05', 'Latitude': '33.291248', 'Longitude': '-111.737427', 'Coordinate Precision': 'Rooftop', 'County Name': 'Maricopa', 'County FIPS': '04013', 'eLOT Sequence': '0160', 'eLOT Sort': 'A', 'Observes DST': 'default', 'RDI': 'Commercial', 'Record Type': 'S', 'Time Zone': 'Mountain', 'ZIP Type': 'Standard'}, 'Analysis': {'Vacant': 'N', 'DPV Match Code': 'Y', 'DPV Footnotes': 'AABB', 'General Footnotes': 'L#', 'CMRA': 'N', 'EWS Match': 'default', 'LACSLink Code': 'default', 'LACSLink Indicator': 'default', 'SuiteLink Match': 'default', 'Enhanced Match': 'default'}, 'Components': {'Urbanization': 'default', 'Primary Number': '3301', 'Street Predirection': 'S', 'Street Name': 'Greenfield', 'Street Postdirection': 'default', 'Street Suffix': 'Rd', 'Secondary Designator': 'default', 'Secondary Number': 'default', 'Extra Secondary Designator': 'default', 'Extra Secondary Number': 'default', 'PMB Designator': 'default', 'PMB Number': 'default', 'City': 'Gilbert', 'Default City Name': 'Gilbert', 'State': 'AZ', 'ZIP Code': '85297', '+4 Code': '2176', 'Delivery Point': '01', 'Check Digit': '2'}}
'05'

Ajax1234, here's the code and screenshot you asked for:

Related

Accessing a webpage within a webpage using BeautifulSoup?

I have written a Python script that parses the data of a webpage using beautifulsoup. What i want to do further is to click the NAME of each person on page, access their profile, then click on the website link on that page and scrape the email id ( if available ) from that website. Can anyone help me out with this? I am new to beautifulsoup and python so i am unable to proceed further. Any help is appreciated.
Thanks!
The kind of link i am working on is:
https://www.realtor.com/realestateagents/agentname-john
Here is my code:
from bs4 import BeautifulSoup
import requests
import csv
##################### Website
##################### URL
w_url = str('https://www.')+str(input('Please Enter Website URL :'))
####################### Number of
####################### Pages
pages = int(input(' Please specify number of pages: '))
####################### Range
####################### Specified
page_range = list(range(0,pages))
####################### WebSite
####################### Name ( in case of multiple websites )
#site_name = int(input('Enter the website name ( IN CAPITALS ) :'))
####################### Empty
####################### List
agent_info= []
####################### Creating
####################### CSV File
csv_file = open(r'D:\Webscraping\real_estate_agents.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Name and Number'])
####################### FOR
####################### LOOP
for k in page_range:
website = requests.get(w_url+'/pg-'+'{}'.format(k)).text
soup = BeautifulSoup(website,'lxml')
class1 = 'jsx-1448471805 agent-name text-bold'
class2 = 'jsx-1448471805 agent-phone hidden-xs hidden-xxs'
for i in soup.find_all('div',class_=[[class1],[class2]]):
w = i.text
agent_info.append(w)
##################### Reomiving
##################### Duplicates
updated_info= list(dict.fromkeys(agent_info))
##################### Writing Data
##################### to CSV
for t in updated_info:
print(t)
csv_writer.writerow([t])
print('\n')
csv_file.close()
Would be more efficient (and less lines of code) if you grab the data from the api. It also appears the website emails are within that too, so if needed, no need to go to each of the 30,000+ websites for that email, so you can get it all in a fraction of the time.
The api also has all the data you'd want/need. For example, here's everythin on just 1 agent:
{'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'advertiser_id': 2121274, 'agent_rating': 5, 'background_photo': {'href': 'https://ap.rdcpix.com/1223152681/cc48579b6a0fe6ccbbf44d83e8f82145g-c0o.jpg'}, 'broker': {'fulfillment_id': 3860509, 'designations': [], 'name': 'BRIDGE REALTY, LLC.', 'accent_color': '', 'photo': {'href': ''}, 'video': ''}, 'description': 'As a professional real estate agent licensed in the State of Minnesota, I am committed to providing only the highest standard of care as I assist you in navigating the twists and turns of home ownership. Whether you are buying or selling your home, I will do everything it takes to turn your real estate goals and desires into a reality. If you are looking for a real estate Agent who will put your needs first and go above and beyond to help you reach your goals, I am the agent for you.', 'designations': [], 'first_month': 0, 'first_name': 'John', 'first_year': 2010, 'has_photo': True, 'href': 'http://www.twincityhomes4sale.com', 'id': '56b63efd7e54f7010021459d', 'is_realtor': True, 'languages': [], 'last_name': 'Palomino', 'last_updated': 'Mon, 04 Jan 2021 18:46:12 GMT', 'marketing_area_cities': [{'city_state': 'Columbus_MN', 'name': 'Columbus', 'state_code': 'MN'}, {'city_state': 'Blaine_MN', 'name': 'Blaine', 'state_code': 'MN'}, {'city_state': 'Circle Pines_MN', 'name': 'Circle Pines', 'state_code': 'MN'}, {'city_state': 'Lino Lakes_MN', 'name': 'Lino Lakes', 'state_code': 'MN'}, {'city_state': 'Lexington_MN', 'name': 'Lexington', 'state_code': 'MN'}, {'city_state': 'Forest Lake_MN', 'name': 'Forest Lake', 'state_code': 'MN'}, {'city_state': 'Chisago City_MN', 'name': 'Chisago City', 'state_code': 'MN'}, {'city_state': 'Wyoming_MN', 'name': 'Wyoming', 'state_code': 'MN'}, {'city_state': 'Centerville_MN', 'name': 'Centerville', 'state_code': 'MN'}, {'city_state': 'Hugo_MN', 'name': 'Hugo', 'state_code': 'MN'}, {'city_state': 'Grant_MN', 'name': 'Grant', 'state_code': 'MN'}, {'city_state': 'St. Anthony_MN', 'name': 'St. Anthony', 'state_code': 'MN'}, {'city_state': 'Arden Hills_MN', 'name': 'Arden Hills', 'state_code': 'MN'}, {'city_state': 'New Brighton_MN', 'name': 'New Brighton', 'state_code': 'MN'}, {'city_state': 'Mounds View_MN', 'name': 'Mounds View', 'state_code': 'MN'}, {'city_state': 'White Bear Township_MN', 'name': 'White Bear Township', 'state_code': 'MN'}, {'city_state': 'Vadnais Heights_MN', 'name': 'Vadnais Heights', 'state_code': 'MN'}, {'city_state': 'Shoreview_MN', 'name': 'Shoreview', 'state_code': 'MN'}, {'city_state': 'Little Canada_MN', 'name': 'Little Canada', 'state_code': 'MN'}, {'city_state': 'Columbia Heights_MN', 'name': 'Columbia Heights', 'state_code': 'MN'}, {'city_state': 'Hilltop_MN', 'name': 'Hilltop', 'state_code': 'MN'}, {'city_state': 'Fridley_MN', 'name': 'Fridley', 'state_code': 'MN'}, {'city_state': 'Linwood_MN', 'name': 'Linwood', 'state_code': 'MN'}, {'city_state': 'East Bethel_MN', 'name': 'East Bethel', 'state_code': 'MN'}, {'city_state': 'Spring Lake Park_MN', 'name': 'Spring Lake Park', 'state_code': 'MN'}, {'city_state': 'North St. Paul_MN', 'name': 'North St. Paul', 'state_code': 'MN'}, {'city_state': 'Maplewood_MN', 'name': 'Maplewood', 'state_code': 'MN'}, {'city_state': 'St. Paul_MN', 'name': 'St. Paul', 'state_code': 'MN'}], 'mls': [{'member': {'id': '506004321'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'A', 'primary': True}], 'nar_only': 1, 'nick_name': '', 'nrds_id': '506004321', 'office': {'name': 'Bridge Realty, Llc', 'mls': [{'member': {'id': '10982'}, 'id': 416, 'abbreviation': 'MIMN', 'type': 'O', 'primary': True}], 'phones': [{'ext': '', 'number': '(952) 368-0021', 'type': 'Home'}], 'phone_list': {'phone_1': {'type': 'Home', 'number': '(952) 368-0021', 'ext': ''}}, 'photo': {'href': ''}, 'slogan': '', 'website': None, 'video': None, 'fulfillment_id': 3027311, 'address': {'line': '1101 E 78TH ST STE 300', 'line2': '', 'city': 'BLOOMINGTON', 'postal_code': '55420', 'state_code': 'MN', 'state': '', 'country': 'US'}, 'email': 'tony#thebridgerealty.com', 'nrds_id': None}, 'party_id': 23115328, 'person_name': 'John Palomino', 'phones': [{'ext': '', 'number': '(763) 458-0788', 'type': 'Mobile'}], 'photo': {'href': 'https://ap.rdcpix.com/900899898/cc48579b6a0fe6ccbbf44d83e8f82145a-c0o.jpg'}, 'recommendations_count': 2, 'review_count': 7, 'role': 'agent', 'served_areas': [{'name': 'Circle Pines', 'state_code': 'MN'}, {'name': 'Forest Lake', 'state_code': 'MN'}, {'name': 'Hugo', 'state_code': 'MN'}, {'name': 'St. Paul', 'state_code': 'MN'}, {'name': 'Minneapolis', 'state_code': 'MN'}, {'name': 'Wyoming', 'state_code': 'MN'}], 'settings': {'share_contacts': False, 'full_access': False, 'recommendations': {'realsatisfied': {'user': 'John-Palomino', 'id': '1073IJk', 'linked': '3d91C', 'updated': '1529551719'}}, 'display_listings': True, 'far_override': True, 'show_stream': True, 'terms_of_use': True, 'has_dotrealtor': False, 'display_sold_listings': True, 'display_price_range': True, 'display_ratings': True, 'loaded_from_sb': True, 'broker_data_feed_opt_out': False, 'unsubscribe': {'autorecs': False, 'recapprove': False, 'account_notify': False}, 'new_feature_popup_closed': {'agent_left_nav_avatar_to_profile': False}}, 'slogan': 'Bridging the gap between buyers & sellers', 'specializations': [{'name': '1st time home buyers'}, {'name': 'Residential Listings'}, {'name': 'Rental/Investment Properties'}, {'name': 'Move Up Buyers'}], 'title': 'Agent', 'types': 'agent', 'user_languages': [], 'web_url': 'https://www.realtor.com/realestateagents/John-Palomino_BLOOMINGTON_MN_2121274_876599394', 'zips': ['55014', '55025', '55038', '55112', '55126', '55421', '55449', '55092', '55434', '55109'], 'email': 'johnpalomino#live.com', 'full_name': 'John Palomino', 'name': 'John Palomino, Agent', 'social_media': {'facebook': {'type': 'facebook', 'href': 'https://www.facebook.com/Johnpalominorealestate'}}, 'for_sale_price': {'count': 1, 'min': 299900, 'max': 299900, 'last_listing_date': '2021-01-29T11:10:24Z'}, 'recently_sold': {'count': 35, 'min': 115000, 'max': 460000, 'last_sold_date': '2020-12-18'}, 'agent_team_details': {'is_team_member': False}}
Code:
import requests
import pandas as pd
import math
# Function to pull the data
def get_agent_info(jsonData, rows):
agents = jsonData['agents']
for agent in agents:
name = agent['person_name']
if 'email' in agent.keys():
email = agent['email']
else:
email = 'N/A'
if 'href' in agent.keys():
website = agent['href']
else:
website = 'N/A'
try:
office_data = agent['office']
office_email = office_data['email']
except:
office_email = 'N/A'
row = {'name':name, 'email':email, 'website':website, 'office_email':office_email}
rows.append(row)
return rows
rows = []
url = 'https://www.realtor.com/realestateagents/api/v3/search'
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'}
payload = {'nar_only': '1','offset': '','limit': '300','marketing_area_cities': '_',
'postal_code': '','is_postal_search': 'true','name': 'john','types': 'agent',
'sort': 'recent_activity_high','far_opt_out': 'false','client_id': 'FAR2.0',
'recommendations_count_min': '','agent_rating_min': '','languages': '',
'agent_type': '','price_min': '','price_max': '','designations': '',
'photo': 'true'}
# Gets 1st page, finds how many pages yoyu'll need to go through, and parses the data
jsonData = requests.get(url, headers=headers, params=payload).json()
total_matchs = jsonData['matching_rows']
total_pages = math.ceil(total_matchs/300)
rows = get_agent_info(jsonData, rows)
print ('Completed: %s of %s' %(1,total_pages))
# Iterate through next pages
for page in range(1,total_pages):
payload.update({'offset':page*300})
jsonData = requests.get(url, headers=headers, params=payload).json()
rows = get_agent_info(jsonData, rows)
print ('Completed: %s of %s' %(page+1,total_pages))
df = pd.DataFrame(rows)
Output: Just the first 10 rows of 30,600
print(df.head(10).to_string())
name email website office_email
0 John Croteau jcrot45#gmail.com https://www.facebook.com/JCtherealtor/ 1worcesterhomes#gmail.com
1 Stephanie St John sstjohn#shorewest.com https://stephaniestjohn.shorewest.com customercare#shorewest.com
2 Johnine Larsen info#realestategals.com http://realestategals.com seattle#northwestrealtors.com
3 Leonard Johnson americandreams#comcast.net http://www.adrhomes.net americandreams#comcast.net
4 John C Fitzgerald john#jcfhomes.com http://www.JCFHomes.com
5 John Vrsansky Jr John#OnTargetRealty.com http://www.OnTargetRealty.com john#ontargetrealty.com
6 John Williams jwilliamsidaho#gmail.com http://www.johnwilliamsidaho.com mpickford#kw.com
7 John Zeiter j.zeiter#ggsir.com info#ggsir.com
8 Mitch Johnson mitchjohnson1316#gmail.com miaroberson#creedrealty.com
9 John Lowe jplowe4#gmail.com http://johnlowegroup.com thedavisgrouponline#gmail.com
I have used requests(docs) instead of beautifulsoup, but still I tried to keep it as simple as possible
I have implement for the mentioned website specifically.I am filtering based on other attributes instead of class names and extracting the agent name from URL.
I am populating the set agentWebsites with required information in format (agentName, collection (tuple) of agentWebsite mentioned in their profile).
I am populating the set agentEmails with required information in format (agentName, collection (tuple) of emails mentioned in their websites).
I am not using a dict with agentName as key and
websites/emails as values since the agentName may not be unique and it can't be
used as a key.
Extracting email from websites:
Not all websites have email mentioned in them, some are dummy websites redirecting to some others and some have a form to fill our details to contact them instead of mentioning theirs.
handling exceptions:
Some websites are not accessible and they will be printed in output.
Some websites are taking lot of time to render, they are also being printed in output. you can increase the value of timeout_length global variable. when I tried, some websites with this error were getting rende
red for 200.
any other exceptions like Connection Error, etc will be handled by caught by last except and message will be printed to output.
Code:
from requests_html import HTMLSession, MaxRetries
from requests.exceptions import ConnectionError
import re
import sys
# Global values to store the links of individual agents, and their websites
agentLinks = set()
agentWebsites = set()
agentEmails = set()
session = HTMLSession()
timeout_length = 10
# urls used
start_url = "https://www.realtor.com/realestateagents/agentname-john"
base_url = "https://www.realtor.com"
# Regex to match emails from website
EMAIL_REGEX = r"""(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])"""
# no of pages to be scraped in website
no_of_pages = int(input("Enter no of pages to be scraped:\t"))
# scraping the links of agent profiles, page by page
for page in range(1, no_of_pages + 1):
r = session.get(start_url + '/pg-' + str(page))
# get all anchor tags
agentInfo = r.html.find('a')
for info in agentInfo:
# filter only agent profiles and extract links
if "href" in info.attrs and info.attrs["href"].startswith("/realestateagents/"):
agentLinks.add(info.attrs["href"])
print('page', page, 'agents found till now', len(agentLinks))
print('Total agents found till now', len(agentLinks))
# scrape the agentProfile page for the website link
print('---Scraping Website from agent Profile and email from agents Websites---')
agent_count = 0
total_agents = len(agentLinks)
for agentLink in agentLinks:
emails = set()
websites = set()
agentName = agentLink.replace("/realestateagents/", "").split('_')[0].replace('-', ' ').title()
# print the profile scraping progress
agent_count += 1
sys.stdout.write("\rscraping agent{0}'s profile".format(agent_count))
sys.stdout.flush()
r = session.get(base_url + agentLink)
# get all anchor tags
agentInfo = r.html.find('a')
for info in agentInfo:
# filter only website link and extract link
if "href" in info.attrs and "data-linkname" in info.attrs and info.attrs[
"data-linkname"] == "realtors:agent_details:contact_details:website":
agentWebsite = info.attrs["href"]
websites.add(agentWebsite)
if websites:
agentWebsites.add((agentName, tuple(websites)))
# print the email scraping progress
sys.stdout.write("\rscraping agent{0}'s websites for emails".format(agent_count))
sys.stdout.flush()
# scrape EMAILS in the websites
for website in websites:
try:
r = session.get(website)
r.html.render(timeout=timeout_length)
for re_match in re.finditer(EMAIL_REGEX, r.html.raw_html.decode()):
if '/' not in re_match.group():
emails.add(re_match.group())
except ConnectionError:
print('\rcannot connect to', website)
except MaxRetries as mr:
print("\r", mr.message.replace('page.', website), sep='')
except:
print("\rUnexpected error for site", website, ":", sys.exc_info()[0])
finally:
# print the email scraping progress
sys.stdout.write("\rscraping agent{0}'s websites for emails".format(agent_count))
sys.stdout.flush()
# after scraping all websites, add all emails found
if emails:
agentEmails.add((agentName, tuple(emails)))
# agentWebsites is a set of tuples of format (agentName, agentWebsite url)
print("\r\nTotal Agent websites scraped", len(agentWebsites))
print(agentWebsites)
print("\nNo of agents with emails scraped", len(agentEmails))
print(agentEmails)
example output:
Enter no of pages to be scraped: 2
page 1 agents found till now 20
page 2 agents found till now 40
Total agents found till now 40
Scraping Website from agent Profile and email from agents Websites
cannot connect to https://www.david-johnston.kw.com
Unable to render the http://www.reefpointrealestate.com/ Try increasing timeout
cannot connect to http://www.patricia-johnson.com
Unable to render the http://palisadeshomes.com/ Try increasing timeout
Unexpected error for site https://www.jwhomesteam.com : <class 'pyppeteer.errors.NetworkError'>
cannot connect to http://www.stevenjohnson.org
cannot connect to http://www.johnrod.com/
cannot connect to http://www.rodneyjohnson.net
cannot connect to http://john.estatesoflasvegas.com
cannot connect to http://www.teamgoodell.com
cannot connect to http://Hilyardproperties.com
Total Agent websites scraped 32
{('John Mcnamara', ('http://www.ttrsir.com',)),... ('Don Johnson Pc', ('https://www.jwhomesteam.com',))}
No of agents with emails scraped 11
{('John Genovese And Richard Lester', ('connect#mycitycountry.com',)), ... ('John "Dan" Bethel', ('therealtygroupohio#gmail.com', 'danbethelteacher#gmail.com'))}
Note:
we can use r.html.find('a', containing='<text>') for filtering, but it didn't seem to work for me.

Get value from data-set field sublist

I have a dataset (that pull its data from a dict) that I am attempting to clean and republish. Within this data set, there is a field with a sublist that I would like to extract specific data from.
Here's the data:
[{'id': 'oH58h122Jpv47pqXhL9p_Q', 'alias': 'original-pizza-brooklyn-4', 'name': 'Original Pizza', 'image_url': 'https://s3-media1.fl.yelpcdn.com/bphoto/HVT0Vr_Vh52R_niODyPzCQ/o.jpg', 'is_closed': False, 'url': 'https://www.yelp.com/biz/original-pizza-brooklyn-4?adjust_creative=IelPnWlrTpzPtN2YRie19A&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=IelPnWlrTpzPtN2YRie19A', 'review_count': 102, 'categories': [{'alias': 'pizza', 'title': 'Pizza'}], 'rating': 4.0, 'coordinates': {'latitude': 40.63781, 'longitude': -73.8963799}, 'transactions': [], 'price': '$', 'location': {'address1': '9514 Ave L', 'address2': '', 'address3': '', 'city': 'Brooklyn', 'zip_code': '11236', 'country': 'US', 'state': 'NY', 'display_address': ['9514 Ave L', 'Brooklyn, NY 11236']}, 'phone': '+17185313559', 'display_phone': '(718) 531-3559', 'distance': 319.98144420799355},
Here's how the data is presented within the csv/spreadsheet:
location
{'address1': '9514 Ave L', 'address2': '', 'address3': '', 'city': 'Brooklyn', 'zip_code': '11236', 'country': 'US', 'state': 'NY', 'display_address': ['9514 Ave L', 'Brooklyn, NY 11236']}
Is there a way to pull location.city for example?
The below code simply adds a few fields and exports it to a csv.
def data_set(data):
df = pd.DataFrame(data)
df['zip'] = get_zip()
df['region'] = get_region()
newdf = df.filter(['name', 'phone', 'location', 'zip', 'region', 'coordinates', 'rating', 'review_count',
'categories', 'url'], axis=1)
if not os.path.isfile('yelp_data.csv'):
newdf.to_csv('data.csv', header='column_names')
else: # else it exists so append without writing the header
newdf.to_csv('data.csv', mode='a', header=False)
If that doesn't make sense, please let me know. Thanks in advance!

scraping Json with python 3

Here is the scirpt:
from bs4 import BeautifulSoup as bs4
import requests
import json
from lxml import html
from pprint import pprint
import re
def get_data():
url = 'https://sports.bovada.lv//baseball/mlb/game-lines-market-group'
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.103 Safari/537.36"})
html_bytes = r.text
soup = bs4(html_bytes, 'lxml')
# res = soup.findAll('script') # find all scripts..
pattern = re.compile(r"swc_market_lists\s+=\s+(\{.*?\})")
script = soup.find("script", text=pattern)
return script.text[23:]
test1 = get_data()
data = json.loads(test1)
for item1 in data['items']:
data1 = item1['itemList']['items']
for item2 in data1:
pitch_a = item2['opponentAName']
pitch_b = item2['opponentBName']
## group = item2['displayGroups']
## for item3 in group:
## new_il = item3['itemList']
## for item4 in new_il:
## market = item4['description']
## oc = item4['outcomes']
print(pitch_a,pitch_b)
##for items in data['items']:
## pos = items['itemList']['items']
## for item in pos:
## work = item['competitors']
## pitcher_a = item['opponentAName']
## pitcher_b = item['opponentBName']
## group = item['displayGroups']
## for item, item2 in zip(work,group):
## team = item['abbreviation']
## place = item['type']
## il2 = item2['itemList']
## for item in il2:
## ml = item['description']
## print(team,place,pitcher_a,pitcher_b,ml)
I have been trying to scrape
team abbrev = ['items']['itemList']['items']['competitors']['abbreviation']
home_away = ['items']['itemList']['items']['competitors']['type']
team pitcher home = ['items']['itemList']['items']['opponentAName']
team pitcher away = ['items']['itemList']['items']['opponentBName']
moneyline american odds = ['items']['itemList']['items']['displayGroups']['itemList']['outcomes']['price']['american']
Total runs = ['items']['itemList']['items']['displayGroups']['itemList']['outcomes']['price']['handicap']
Part of the Json pprinted:
[{'baseLink': '/baseball/mlb/game-lines-market-group',
'defaultType': True,
'description': 'Game Lines',
'id': '136',
'itemList': {'items': [{'LIVE': True,
'atmosphereLink': '/api/atmosphere/eventNotification/events/A/3149961',
'awayTeamFirst': True,
'baseLink': '/baseball/mlb/minnesota-twins-los-angeles-angels-201805112207',
'competitionId': '24736',
'competitors': [{'abbreviation': 'LAA',
'description': 'Los Angeles Angels',
'id': '3149961-1642',
'rotationNumber': '978',
'shortName': 'Angels',
'type': 'HOME'},
{'abbreviation': 'MIN',
'description': 'Minnesota Twins',
'id': '3149961-9990',
'rotationNumber': '977',
'shortName': 'Twins',
'type': 'AWAY'}],
'denySameGame': 'NO',
'description': 'Minnesota Twins # Los Angeles Angels',
'displayGroups': [{'baseLink': '/baseball/mlb/game-lines-market-group',
'defaultType': True,
'description': 'Game Lines',
'id': '136',
'itemList': [{'belongsToDefault': True,
'columns': 'H2Columns',
'description': 'Moneyline',
'displayGroups': '136,A-136',
'id': '46892277',
'isInRunning': True,
'mainMarketType': 'MONEYLINE',
'mainPeriod': True,
'marketTypeGroup': 'MONEY_LINE',
'notes': '',
'outcomes': [{'competitorId': '3149961-9990',
'description': 'Minnesota '
'Twins',
'id': '211933276',
'price': {'american': '-475',
'decimal': '1.210526',
'fractional': '4/19',
'id': '1033002124',
'outcomeId': '211933276'},
'status': 'OPEN',
'type': 'A'},
{'competitorId': '3149961-1642',
'description': 'Los '
'Angeles '
'Angels',
'id': '211933277',
'price': {'american': '+310',
'decimal': '4.100',
'fractional': '31/10',
'id': '1033005679',
'outcomeId': '211933277'},
'status': 'OPEN',
'type': 'H'}],
'periodType': 'Live '
'Match',
'sequence': '14',
'sportCode': 'BASE',
'status': 'OPEN',
'type': 'WW'},
{'belongsToDefault': True,
'columns': 'H2Columns',
'description': 'Runline',
'displayGroups': '136,A-136',
'id': '46892287',
'isInRunning': True,
'mainMarketType': 'SPREAD',
'mainPeriod': True,
'marketTypeGroup': 'SPREAD',
'notes': '',
'outcomes': [{'competitorId': '3149961-9990',
'description': 'Minnesota '
'Twins',
'id': '211933278',
'price': {'american': '+800',
'decimal': '9.00',
'fractional': '8/1',
'handicap': '-1.5',
'id': '1033005677',
'outcomeId': '211933278'},
'status': 'OPEN',
'type': 'A'},
{'competitorId': '3149961-1642',
'description': 'Los '
'Angeles '
'Angels',
'id': '211933279',
'price': {'american': '-2000',
'decimal': '1.050',
'fractional': '1/20',
'handicap': '1.5',
'id': '1033005678',
'outcomeId': '211933279'},
'status': 'OPEN',
'type': 'H'}],
'periodType': 'Live '
'Match',
'sequence': '14',
'sportCode': 'BASE',
'status': 'OPEN',
'type': 'SPR'}],
'link': '/baseball/mlb/game-lines-market-group'}],
'feedCode': '13625145',
'id': '3149961',
'link': '/baseball/mlb/minnesota-twins-los-angeles-angels-201805112207',
'notes': '',
'numMarkets': 2,
'opponentAId': '214704',
'opponentAName': 'Tyler Skaggs (L)',
'opponentBId': '215550',
'opponentBName': 'Lance Lynn (R)',
'sport': 'BASE',
'startTime': 1526090820000,
'status': 'O',
'type': 'MLB'},
There are a few different loops I had started in the script above but either of them are working out the way I would like.
away team | away moneyline | away pitcher | Total Runs | and repeat for Home Team is what I would like it to be eventually. I can write to csv once it is parsed the proper way.
Thank you for the fresh set of eyes, I've been working on this for the better part of a day trying to figure out the best way to access the content I would like. If Json is not the best way and bs4 works better I would love to hear your opinion
There's no simple answer to your problem. Scraping data requires you to carefully assess the data you are dealing with, work out where the parts you want to extract are located and figure out how to effectively store the data you extract.
Try printing the data in your loops to visualise what is happening in your code (or try debugging). From there its easy to figure out it if you're iterating over what you expect. Look for patterns throughout the input data to help organise the data you extract.
To help yourself, you should give your variables descriptive names, separate your code into logical chunks and add comments when it starts to get complicated.
Here's some working code, but I encourage you to try what I told you above, then if you're still stuck look below for guidance.
output = {}
root = data['items'][0]
for game_line in root['itemList']['items']:
# Create a temporary dict to store the data for this gameline
team_data = {}
# Get competitors
competitors = game_line['competitors']
for team in competitors:
team_type = team['type'] # either HOME or AWAY
# Create a new dict to store data for each team
team_data[team_type] = {}
team_data[team_type]['abbreviation'] = team['abbreviation']
team_data[team_type]['name'] = team['description']
# Get MoneyLine and Total Runs
for item in game_line['displayGroups'][0]['itemList']:
for outcome in item['outcomes']:
team_type = outcome['type'] # either A or H
team_type = 'HOME' if team_type == 'H' else 'AWAY'
if item['mainMarketType'] == 'MONEYLINE':
team_data[team_type]['moneyline'] = outcome['price']['american']
elif item['mainMarketType'] == 'SPREAD':
team_data[team_type]['total runs'] = outcome['price']['handicap']
# Get the pitchers
team_data['HOME']['pitcher'] = game_line['opponentAName']
team_data['AWAY']['pitcher'] = game_line['opponentBName']
# For each gameline, add the teamdata we gathered to the output dict
output[game_line['description']] = team_data
This produces like:
{
'Atlanta Braves # Miami Marlins': {
'AWAY': {
'abbreviation': 'ATL',
'moneyline': '-130',
'name': 'Atlanta Braves',
'pitcher': 'Mike Soroka (R)',
'total runs': '-1.5'
},
'HOME': {
'abbreviation': 'MIA',
'moneyline': '+110',
'name': 'Miami Marlins',
'pitcher': 'Jarlin Garcia (L)',
'total runs': '1.5'
}
},
'Boston Red Sox # Toronto Blue Jays': {
'AWAY': {
'abbreviation': 'BOS',
'moneyline': '-133',
'name': 'Boston Red Sox',
'pitcher': 'David Price (L)',
'total runs': '-1.5'
},
'HOME': {
'abbreviation': 'TOR',
'moneyline': '+113',
'name': 'Toronto Blue Jays',
'pitcher': 'Marco Estrada (R)',
'total runs': '1.5'
}
},
}

Splitting multiple Dictionaries within a Pandas Column

I'm trying to split a dictionary with a list within a pandas column but it isn't working for me...
The column looks like so when called;
df.topics[3]
Output
"[{'urlkey': 'webdesign', 'name': 'Web Design', 'id': 659}, {'urlkey': 'productdesign', 'name': 'Product Design', 'id': 2993}, {'urlkey': 'internetpro', 'name': 'Internet Professionals', 'id': 10102}, {'urlkey': 'web', 'name': 'Web Technology', 'id': 10209}, {'urlkey': 'software-product-management', 'name': 'Software Product Management', 'id': 42278}, {'urlkey': 'new-product-development-software-tech', 'name': 'New Product Development: Software & Tech', 'id': 62946}, {'urlkey': 'product-management', 'name': 'Product Management', 'id': 93740}, {'urlkey': 'internet-startups', 'name': 'Internet Startups', 'id': 128595}]"
I want to only be left with the 'name' and 'id' to put into separate columns of topic_1, topic_2, and so forth.
Appreciate any help.
You can give this a try.
import json
df.topics.apply(lambda x : {x['id']:x['name'] for x in json.loads(x.replace("'",'"'))} )
Your output for the row you gave is :
{659: 'Web Design',
2993: 'Product Design',
10102: 'Internet Professionals',
10209: 'Web Technology',
42278: 'Software Product Management',
62946: 'New Product Development: Software & Tech',
93740: 'Product Management',
128595: 'Internet Startups'}
You should try a simple method
dt = df.topic[3]
li = []
for x in range(len(dt)):
t = {dt[x]['id']:dt[x]['name']}
li.append(t)
print(li)
Output is-
[{659: 'Web Design'},
{2993: 'Product Design'},
{10102: 'Internet Professionals'},
{10209: 'Web Technology'},
{42278: 'Software Product Management'},
{62946: 'New Product Development: Software & Tech'},
{93740: 'Product Management'},
{128595: 'Internet Startups'}]
First we takes the value of df.topic[3] in dt which is in form of list and dictionary inside the list, then we take an temp list li[] in which we add(append) our values, Now we run the loop for the length of values of de.topic(which we takes as dt), Now in t we are adding id or name by dt[0]['id'] or dt[0]['name'] which is '659:'Web Design' as x increase all values are comes in t, then by { : }
we are converting the values in Dictionary and append it to the temporary list li

Python/Shell Script - Merging 2 rows of a CSV file where Address column has 'New Line' character

I have a CSV file, which contains couple of columns. For Example :
FName,LName,Address1,City,Country,Phone,Email
Matt,Shew,"503, Avenue Park",Auckland,NZ,19809224478,matt#xxx.com
Patt,Smith,"503, Baker Street
Mickey Park
Suite 510",Austraila,AZ,19807824478,patt#xxx.com
Doug,Stew,"12, Main St.
21st Lane
Suit 290",Chicago,US,19809224478,doug#xxx.com
Henry,Mark,"88, Washington Park",NY,US,19809224478,matt#xxx.com
In excel it looks something likes this :
It's a usual human tendency to feed/copy-paste address in the particular manner, usually sometimes people copy their signature and paste it to the Address column which creates such situation.
I have tried reading this using Python CSV module and it looks like that python doesn't distinguish between the '\n' Newline between the field values and the end of line.
My code :
import csv
with open(file_path, 'r') as f_obj:
input_data = []
reader = csv.DictReader(f_obj)
for row in reader:
print row
The output looks somethings like this :
{'City': 'Auckland', 'Address1': '503, Avenue Park', 'LName': 'Shew', 'Phone': '19809224478', 'FName': 'Matt', 'Country': 'NZ', 'Email': 'matt#xxx.com'}
{'City': 'Austraila', 'Address1': '503, Baker Street\nMickey Park\nSuite 510', 'LName': 'Smith', 'Phone': '19807824478', 'FName': 'Patt', 'Country': 'AZ', 'Email': 'patt#xxx.com'}
{'City': 'Chicago', 'Address1': '12, Main St. \n21st Lane \nSuit 290', 'LName': 'Stew', 'Phone': '19809224478', 'FName': 'Doug', 'Country': 'US', 'Email': 'doug#xxx.com'}
{'City': 'NY', 'Address1': '88, Washington Park', 'LName': 'Mark', 'Phone': '19809224478', 'FName': 'Henry', 'Country': 'US', 'Email': 'matt#xxx.com'}
I just wanted to write the same content to a file where all the values for a Address1 keys should not have '\n' character and looks like :
{'City': 'Auckland', 'Address1': '503, Avenue Park', 'LName': 'Shew', 'Phone': '19809224478', 'FName': 'Matt', 'Country': 'NZ', 'Email': 'matt#xxx.com'}
{'City': 'Austraila', 'Address1': '503, Baker Street Mickey Park Suite 510', 'LName': 'Smith', 'Phone': '19807824478', 'FName': 'Patt', 'Country': 'AZ', 'Email': 'patt#xxx.com'}
{'City': 'Chicago', 'Address1': '12, Main St. 21st Lane Suit 290', 'LName': 'Stew', 'Phone': '19809224478', 'FName': 'Doug', 'Country': 'US', 'Email': 'doug#xxx.com'}
{'City': 'NY', 'Address1': '88, Washington Park', 'LName': 'Mark', 'Phone': '19809224478', 'FName': 'Henry', 'Country': 'US', 'Email': 'matt#xxx.com'}
Any suggestions guys ???
PS:
I have more than 100K such records in my csv file !!!
You can replace the print row with a dict comprehsion that replaces newlines in the values:
row = {k: v.replace('\n', ' ') for k, v in row.iteritems()}
print row

Categories

Resources