Python HTMl Scrape Not Yielding A Result - python

(New to Python and 1st post)
See code below, but here's the issue:
I'm trying to scrape the webpage in the code for all job titles on the page, but when I print the list, I'm not getting any values. I've tried using different xpaths to see if I could get something to print, but every time my list is always empty.
Does anyone know if it is an issue with my code, or if there is something about the site structure that I didn't consider?
Thanks in advance!
from lxml import html
import requests
page = requests.get("https://careers.homedepot.com/job-search-results/?location=Atlanta%2C%20GA%2C%20United%20States&latitude=33.7489954&longitude=-84.3879824&radius=15&parent_category=Corporate%2FOther")
tree = html.fromstring(page.content)
Job_Title = tree.xpath('//*[#id="widget-jobsearch-results-list"]/div/div/div/div[#class="jobTitle"]/a/text()')
print (Job_Title)

Information that you're looking for is generated dynamically with some JavaScript while requests allows to get just initial HTML page source.
You might need to use selenium(+chromedriver) to get required data:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://careers.homedepot.com/job-search-results/?location=Atlanta%2C%20GA%2C%20United%20States&latitude=33.7489954&longitude=-84.3879824&radius=15&parent_category=Corporate%2FOther")
xpath = "//a[starts-with(#id, 'job-results')]"
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
jobs = [job.text for job in driver.find_elements_by_xpath(xpath)]

Try a library that can parse JS (dryscrape is a lightweight alternative).
Here's a code sample
from lxml import html
import requests
import dryscrape
session = dryscrape.Session()
session.visit("https://careers.homedepot.com/job-search-results/?location=Atlanta%2C%20GA%2C%20United%20States&latitude=33.7489954&longitude=-84.3879824&radius=15&parent_category=Corporate%2FOther")
page = session.body()
tree = html.fromstring(page.content)
Job_Title = tree.xpath('//*[#id="widget-jobsearch-results-list"]/div/div/div/div[#class="jobTitle"]/a/text()')
print (Job_Title)

That page build HTML(table) with JS. In other words, Target block does not exist as HTML on that page. Please open the source and check it.
<div class="entry-content-wrapper clearfix">
<div id="widget-jobsearch-results-list"></div> # <- Target block is empty!
<div id="widget-jobsearch-results-pages"></div>
</div>

Related

Why am I getting inconsistent results from web scraping?

I'm having issues scraping data from a website. The issue might be with Visual Studio Code, I am using the "Code Runner" extension. This is my first time using Beautiful Soup and Selenium so the issue might also be with my code. I started last Friday and after some difficulty came up with a solution on Saturday. My code is:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
parcelID = 1014100000 #this is a random parcelID I grabbed from the site
url = 'https://www.manateepao.com/parcel/?parid={}'.format(parcelID)
driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
#was getting encoding error with print(html). replaced character that was giving me trouble
newHTML = html.replace(u"\u2715", "*")
soupFilter = SoupStrainer('div', {'id': 'ownerContentScollContainer'})
soup = BeautifulSoup(newHTML, 'html.parser', parse_only=soupFilter)
webparcelID = soup.find_all('b')
lColumn = soup.find_all('div', {'class' : 'col-sm-2 m-0 p-0 text-sm-right'})
rColumn = soup.find_all('div', {'class' : 'col-sm m-0 p-0 ml-2'})
parcel_Dict = {}
for i in range(len(lColumn)):
parcel_Dict[i] = {lColumn[i].string: rColumn[i].string}
#This is to test if I got any results or not
print(parcel_Dict)
driver.close()
driver.quit()
What I am hoping to find each time I scrape a page is:
The Parcel ID. This is in its own bold, b, tag.
The Ownership and Mailing Address. The Ownership should always be at parcel_Dict[1] and the mailing address should always be at parcel_Dict[3].
I run the code and sometimes I get a result, and other times I get an empty dictionary.
Thank you for any help you can provide.
I solved my own issue by adding the following lines of code
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.XPATH, "//div[#id='ownerContentScollContainer']")))
I waited until the ownerContentScrollContainer was fully loaded before proceeding to execute the rest of the code.
This post and this post helped me figure out where I might be going wrong. I used this tutorial to figure out how to use the appropriate Xpath.

Can't scrape <ui-tags> in python. Unsure why?

I am trying to scrape AFL odds from betfair (https://www.betfair.com.au/exchange/plus/australian-rules).
I am fairly new to webscraping however have managed to scrape odds from other bookies but i am having trouble with Betfair. The data I need is within a "ui-view" tag which doesn't seem to open when I use Beautiful soup to get the HTML.
I've tried unsuccessfully to use selenium when loading the page to get the odds.
from selenium import webdriver
from bs4 import BeautifulSoup
import pprint as pp
BETFAIR_URL = "https://www.betfair.com.au/exchange/plus/australian-rules"
#functions
def parse(url):
# open url
driver = webdriver.Chrome(
'C:/Users/Maroz/Downloads/chromedriver_win32 (1)/chromedriver.exe')
# opens page
driver.get(url)
# parses as html
soup = BeautifulSoup(driver.page_source, 'html.parser')
# closes same
driver.quit()
return soup
betfair_soup = parse(BETFAIR_URL)
pp.pprint(betfair_soup)
#edit to show that it finds nothing in the span i need which is within the ui-#tags
price = betfair_soup.find_all("span", {"class": "bet-button-price"})
pp.pprint(price)
#output is []
I expected the betfair_soup to contain the infomation within this tag ui-view
however it remains closed when printed to the terminal.
Won't let me post an image because this is my first post but you might be able to see a screenshot of the tags I am trying to access here. https://imgur.com/gallery/jI3MQYY
As requested here is the html I get in terminal:
<!--[if IE]>
<script type="text/javascript">window['isIE'] = true;</script>
<![endif]-->
<!-- Set ie10 class: http://www.impressivewebs.com/ie10-css-hacks/ -->
<!--[if !IE]><!-->
<script>
(function () {
var isIE10 = Function('/*#cc_on return document.documentMode===10#*/')();
if (isIE10) {
document.documentElement.className += ' ie10';
}
})();
</script>
<!--<![endif]-->
<bf-meta-tags></bf-meta-tags>
<bf-tooltip-guide><div class="tooltip-guide-container" ng-controller="TooltipGuideController as controller"><!-- --></div></bf-tooltip-guide>
<!-- --><ui-view></ui-view> #INFO IS IN HERE
<script src="//ie2eds.cdnppb.net/resources/eds/bundle/vendor-assets-min_4146.js"></script>
<script src="//ie2eds.cdnppb.net/resources/eds/bundle/bf-eds-static-client.min_4146_.js"></script>
<script type="text/javascript">
I put a comment where the odds are located, when i view the page source the tags are also closed, so there isn't any way of me showing you what i see when i inspect element on the odds box other than the photo link i posted above^
edit: After trying suggestion to wait for ui-view to load this is the entire response, I still couldn't access the information in the span tags though.
https://pastebin.com/v6JzYa6V
FINAL EDIT: Problem solved! Thank you to everyone for your suggestions and special thanks to S Ahmed for his persistence in solving this for me!
Looks like it takes time to load the content of the <ui-view> tag and it is loaded by javascript. Try to wait for an internal element to be present before getting the source of the page.
Try this:
def parse(url):
driver.get(url)
try:
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.ID,"main-wrapper")))
except:
pp.pprint("Exception")
finally:
soup = BeautifulSoup(driver.page_source, 'html.parser')
return soup
driver.quit()
You have to import the following libraries
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
Edit:
Try waiting for the span.bet-button-price to be present instead of the #main-wrap
WebDriverWait(driver, 30).until(EC.presence_of_element_located((By.CSS_SELECTOR,"span.bet-button-price")))

Expanding DOM lists to extract additional content with Python, Selenium, Beautiful Soup

I am looking to extract content from a page that is requires a list node to be selected. I have retrieve the page html using python and Selenium. Passing the page source to BS4 I can parse out the content that I am looking for using
open_li = soup.select('div#tree ul.jstree-container-ul li')
Each list item returned has an
aria-expanded = "false" and class="jstree-node jstree-closed"
Looking at inspect element the content is called when these variables are set to
aria-expanded = "true" and class="jstree-node jstree-open"
I have tried using .click method on the content
driver.find_element_by_id('tree').click()
But that only changes other content on the page. I think the list nodes themselves have to be expanded when making the request.
Does someone know how to change aria-expand elements on a page before returning the content ?
Thanks
You can use requests package to get all information as a json.
Here example how you can get all information the page:
import requests
if __name__ == '__main__':
url = "https://app.updateimpact.com/api/singleArtifact?artifactId=commons-lang3&groupId=org.apache.commons&version=3.7"
req_params = requests.get(url).json()
response = requests.get(
'https://app.updateimpact.com/api/builds/%s/%s' % (req_params["userIdStr"], req_params["buildId"]))
print(response.json())
There could be multiple reasons for not getting the output
a) You are clicking on the wrong element
b) You are not waiting for the element to be loaded before clicking on it
c) You are not waiting for the content to be loaded after clicking on the element
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('/path/to/chromedriver')
url="https://app.updateimpact.com/treeof/org.apache.commons/commons-lang3/3.7"
driver.get(url)
element=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[#id="org.apache.commons:commons-lang3:3.7:jar_anchor"]/span')))
element.click()
element=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[#id="tree-detail"]/div[2]/span[1]')))
print(driver.find_element_by_xpath('//*[#id="detail_div"]').text)
Output
org.apache.commons:commons-lang3:3.7:jar (back)
Project module (browse only dependencies of this module)
Group id org.apache.commons
Artifact id commons-lang3
Version 3.7
Type jar
This dependency isn't a dependency of any other dependencies.

LXML XPATH - Data returned from one site and not another

I'm just learning python and decided to play with some website scraping.
I created 1 that works, and a second, almost identical as far as I can tell, that doesn't work, and I can't figure out why.
from lxml import html
import requests
page = requests.get('https://thronesdb.com/set/Core')
tree = html.fromstring(page.content)
cards = [tree.xpath('//a[#class = "card-tip"]/text()'),tree.xpath('//td[#data-th = "Faction"]/text()'),
tree.xpath('//td[#data-th = "Cost"]/text()'),tree.xpath('//td[#data-th = "Type"]/text()'),
tree.xpath('//td[#data-th = "STR"]/text()'),tree.xpath('//td[#data-th = "Traits"]/text()'),
tree.xpath('//td[#data-th = "Set"]/text()'),tree.xpath('//a[#class = "card-tip"]/#data-code')]
print(cards)
That one does what I expect (I know it's not pretty). It grabs certain elements from a table on the site.
This one returns [[]]:
from lxml import html
import requests
page = requests.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
tree = html.fromstring(page.content)
offers = [tree.xpath('//a[#class = "offer_title"]/text()')]
print(offers)
What I expect it to do is give me a list that has the text from each offer_title element on the page.
The xpath I'm gunning at I grabbed from Firebug, which is:
/html/body/div[1]/div/div/div/section/div[2]/ul[1]/li[2]/div/h3/a
Here's the actual string from the site:
Angelcare Digital Video And Sound Monitor - $89.99 ($90.00 Off)
I have also read a few other questions, but they didn't answer how this could work the first way, but not the second. Can't post them because of the link restrictions on new accounts.
Titles:
Python - Unable to Retrieve Data From Webpage Table Using Beautiful
Soup or lxml xpath
Python lxml xpath no output
Trouble with scraping text from site using lxml / xpath()
Any help would be appreciated. I did some reading on the lxml website about xpath, but I may be missing something in the way I'm building a query.
Thanks!
The reason why the first code is working is that required data is initially present in DOM while on second page required data is generated dynamically by JavaScript, so you cannot scrape it because requests doesn't support handling dynamic content.
You can try to use, for example, Selenium + PhantomJS to get required data as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = webdriver.PhantomJS(executable_path='/path/to/phantomJS')
driver.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
xpath = '//a[#class = "offer_title"]'
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
offers = [link.get_attribute('textContent') for link in driver.find_elements_by_xpath(xpath)]

Urllib Python is not providing with the html code I see with with inspect element

I'm trying to crawl the results in this link:
url = "http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F"
When I inspect it with firebug I can see the html code and I know what I need to do to extract the tweets. The problem is when I get the response using urlopen, i don't get the same html code. I only get tags. What am I missing?
Example code below:
def get_tweets(section_url):
html = urlopen(section_url).read()
soup = BeautifulSoup(html, "lxml")
tweets = soup.find("div", "results")
category_links = [dd.a["href"] for tweet in tweets.findAll("div", "result-tweet")]
return category_links
url = "http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F"
cat_links = get_tweets(url)
Thanks,
YB
The problem is that the content of results div is filled up with extra HTTP call and javascript code being executed on the browser side. urllib only "sees" the initial HTML page that doesn't contain the data you need.
One option would be to follow #Himal's suggestion and simulate the underlying request to trackbacks.js that is sent for the data with tweets. The result is in JSON format that you can load() using json module coming with standard library:
import json
import urllib2
url = 'http://otter.topsy.com/trackbacks.js?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F&infonly=0&call_timestamp=1411090809443&apikey=09C43A9B270A470B8EB8F2946A9369F3'
data = json.load(urllib2.urlopen(url))
for tweet in data['response']['list']:
print tweet['permalink_url']
Prints:
http://twitter.com/Evonomie/status/512179917610835968
http://twitter.com/abs_office/status/512054653723619329
http://twitter.com/TKE_Global/status/511523709677756416
http://twitter.com/trevinocreativo/status/510216232122200064
http://twitter.com/TomCrouser/status/509730668814028800
http://twitter.com/Evonomie/status/509703168062922753
http://twitter.com/peterchaly/status/509592878491136000
http://twitter.com/chandagarwala/status/509540405411840000
http://twitter.com/Ayjay4650/status/509517948747526144
http://twitter.com/Marketingccc/status/509131671900536832
This was "going down to metal" option.
Otherwise, you can take a "high-level" approach and don't bother about what is there happening under-the-hood. Let the real browser load the page which you would interact with through selenium WebDriver:
from selenium import webdriver
driver = webdriver.Chrome() # can be Firefox(), PhantomJS() and more
driver.get("http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F")
for tweet in driver.find_elements_by_class_name('result-tweet'):
print tweet.find_element_by_xpath('.//div[#class="media-body"]//ul[#class="inline"]/li//a').get_attribute('href')
driver.close()
Prints:
http://twitter.com/Evonomie/status/512179917610835968
http://twitter.com/abs_office/status/512054653723619329
http://twitter.com/TKE_Global/status/511523709677756416
http://twitter.com/trevinocreativo/status/510216232122200064
http://twitter.com/TomCrouser/status/509730668814028800
http://twitter.com/Evonomie/status/509703168062922753
http://twitter.com/peterchaly/status/509592878491136000
http://twitter.com/chandagarwala/status/509540405411840000
http://twitter.com/Ayjay4650/status/509517948747526144
http://twitter.com/Marketingccc/status/509131671900536832
This is how you can scale the second option to get all of tweets following pagination:
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
BASE_URL = 'http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F&offset={offset}'
driver = webdriver.Chrome()
# get tweets count
driver.get('http://topsy.com/trackback?url=http%3A%2F%2Fmashable.com%2F2014%2F08%2F27%2Faustralia-retail-evolution-lab-aopen-shopping%2F')
tweets_count = int(driver.find_element_by_xpath('//li[#data-name="all"]/a/span').text)
for x in xrange(0, tweets_count, 10):
driver.get(BASE_URL.format(offset=x))
# page header appears in case no more tweets found
try:
driver.find_element_by_xpath('//div[#class="page-header"]/h3')
except NoSuchElementException:
pass
else:
break
# wait for results
WebDriverWait(driver, 5).until(
EC.presence_of_element_located((By.ID, "results"))
)
# get tweets
for tweet in driver.find_elements_by_class_name('result-tweet'):
print tweet.find_element_by_xpath('.//div[#class="media-body"]//ul[#class="inline"]/li//a').get_attribute('href')
driver.close()

Categories

Resources