I am trying to extract data from a search box, you can see a good example on wikipedia
This is my code:
driver = webdriver.Firefox()
driver.get(response.url)
city = driver.find_element_by_id('searchInput')
city.click()
city.clear()
city.send_keys('a')
time.sleep(1.5) #waiting for ajax to load
selen_html = driver.page_source
#print selen_html.encode('utf-8')
hxs = HtmlXPathSelector(text=selen_html)
ajaxWikiList = hxs.select('//div[#class="suggestions"]')
items=[]
for city in ajaxWikiList:
item=TestItem()
item['ajax'] = city.select('/div[#class="suggestions-results"]/a/#title').extract()
items.append(item)
print items
Xpath expression is ok, I checked on a static page. If I uncomment the line that prints out scrapped html code the code for the box shows at the end of the file. But for some reason I can't extract data from it with the above code? I must miss something since I tried 2 different sources, wikipedia page is just another source where I can't get these data extracted.
Any advice here? Thanks!
Instead of passing the .page_source which in your case contains an empty suggestions div, get the innerHTML of the element and pass it to the Selector:
selen_html = driver.find_element_by_class_name('suggestions').get_attribute('innerHTML')
hxs = HtmlXPathSelector(text=selen_html)
suggestions = hxs.select('//div[#class="suggestions-results"]/a/#title').extract()
for suggestion in suggestions:
print suggestion
Outputs:
Animal
Association football
Arthropod
Australia
AllMusic
African American (U.S. Census)
Album
Angiosperms
Actor
American football
Note that it would be better to use selenium Waits feature to wait for the element to be accessible/visible, see:
How can I get Selenium Web Driver to wait for an element to be accessible, not just present?
Selenium waitForElement
Also, note that HtmlXPathSelector is deprecated, use Selector instead.
Related
here are the two tags I am trying to scrape: https://i.stack.imgur.com/a1sVN.png. In case you are wondering, this is the link to that page (the tags I am trying to scrape are not behind the paywall): https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635
Below is the code in python I am using, does anyone know why the tags are not properly being stored in paragraphs?
from selenium import webdriver
from selenium.webdriver.common.by import By
url = 'https://www.wsj.com/articles/chinese-health-official-raises-covid-alarm-ahead-of-lunar-new-year-holiday-11672664635'
driver = webdriver.Chrome()
driver.get(url)
paragraphs = driver.find_elements(By.CLASS_NAME, 'css-xbvutc-Paragraph e3t0jlg0')
print(len(paragraphs)) # => prints 0
So you have two problems impacting you.
you should wait for the page to load after you get() the webpage. You can do this with something like import time and time.sleep(10)
The elements that you are trying to scrape, the class tags that you are searching for change on every page load. However, the fact that it is a data-type='paragraph' stays constant, therefore you are able to do:
paragraphs = driver.find_elements(By.XPATH, '//*[#data-type="paragraph"]') # search by XPath to find the elements with that data attribute
print(len(paragraphs))
prints: 2 after the page is loaded.
Just to add-on to #Andrew Ryan's answer, you can use explicit wait for shorter and more dynamical waiting time.
paragraphs = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located((By.XPATH, '//*[#data-type="paragraph"]'))
)
print(len(paragraphs))
So here is my code:
# -*- coding: utf-8 -*-
import scrapy
from ..items import LowesspiderItem
from scrapy.http import Request
import requests
import pandas as pd
class LowesSpider(scrapy.Spider):
name = 'lowes'
def start_requests(self):
start_urls = ['https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Alpine-Brushed-Nickel-2-Handle-Widespread-Bathroom-Sink-Faucet-with-Drain/1002623090']
for url in start_urls:
yield Request(url,
headers={'Cookie': 'sn=2333;'}, #Preset a location
meta={'dont_merge_cookies': True, #Allows location cookie to get through
'url':url}) #Using to get the product SKU
def parse(self, response):
item = LowesspiderItem()
#get product price
productPrice = response.css('.sc-kjoXOD iNCICL::text').get()
item["productPrice"] = productPrice
yield item
So this was working last week, but then my spider was deprecated because I assume the website was modified so all my selectors broke. I'm trying to find the new selector for the price but I'm not having any luck.
First I checked if this data was being created dynamically (it's not) so I think using normal scrapy should be fine, correct me if I'm wrong. Here's a screenshot of the page when JavaScript is disabled
So then, I inspected page source and just CTRL + F the price to find the selector that I'd want/need.
and here's the screenshot as text (if that would help)
left"><svg data-test="arrow-left" color="interactive" viewBox="0 0 24 24" class="sc-jhAzac boeLhr"><path d="M16.88 5.88L15 4l-8 8 8 8 1.88-1.88L10.773 12z"></path></svg></button><button class="arrowNav right"><svg data-test="arrow-right" color="interactive" viewBox="0 0 24 24" class="sc-jhAzac boeLhr"><path d="M8.88 4L7 5.88 13.107 12 7 18.12 8.88 20l8-8z"></path></svg></button></div></div></div></div></div><div class="sc-iQKALj jeIzsl"><div class="sc-gwVKww kbEspX"><div class="sc-esOvli jhvGZy"><div tabindex="0" class="styles__PriceWrapper-sc-1ezid1y-0 cgqauT"><span class="finalPrice"><div class="sc-kjoXOD iNCICL">$314.96 </div><span class="aPrice large" aria-hidden="true"><sup itemProp="PriceCurrency" content="USD" aria-hidden="true">$</sup><span aria-hidden="true">314</span><sup aria-hidden="true">.<!-- --
and here is the link for the page source:
view-source:https://www.lowes.com/pd/ZLINE-KITCHEN-BATH-Alpine-Brushed-Nickel-2-Handle-Widespread-Bathroom-Sink-Faucet-with-Drain/1002623090
Edit*
Looking into the website I thought this selector would make more sense:
productPrice = response.css('.primary-font jumbo strong art-pd-contractPricing::text').get()
because:
the price is nested under this selector, but I still get none. I originally thought it was because this is a 'sale' price, so I checked if it was somehow generated through JavaScript, which it is not.
EDIT: So if anyone ever decides to scrape this website, the prices on their products will differ based on location. The cookie I had set was not for the location I had for my local store.
As the most cursory use of scrapy shell would have shown you, response.css('.sc-kjoXOD iNCICL') is not the correct CSS selector for your case, since space means descendant
Based on your updated comment about the sale priced pages differing from the other pages, one needs to use a more generic selector. Thankfully, Lowe's seems to honor the https://schema.org/Offer standard which defines a price itemprop, meaning you have better-than-average confidence the markup won't change from sale page to non-sale page
for offer in response.css('[itemtype="http://schema.org/Offer"]'):
offered_price = offer.css('[itemprop="price"][content]').xpath('#content').get()
The asterisk to that comment is that the schema.org standard allows encoding the itemprop information in quite a few ways, and their use of the content="" attribute is only the current way, so watch out for that change
So then, I inspected page source and just CTRL + F the price to find the selector that I'd want/need.
This is not original source - this is html code that already changed as result of javascript.
You need to review original html code as scrapy works with raw html
You can access it by pressing CTRU + U or Right mouse button -> View page source (not Inspect)
As result you can see that there is significant difference between original html code and html code changed by javascript.
In original code price occures in several places.
Inside script tag. - some options.
Inside input tag:
price = response.css("input[data-productprice]::attr(data-productprice)").extract_first()
Inside span tag:
price = response.css('span[itemprop="price"]::attr(content)').extract_first()
UPD
fullprices and saleprices selectors will be slightly different.
saleprice = response.css('span[itemprop="price"]::attr(content)').extract_first()
wasprice_text = response.css('span.art-pd-wasPriceLbl::text).extract_first()
if "$" in wasprice_text:
fullprice = wasprice_text.split("$")[-1]
I'am trying to scrape matches and their respective odds from local bookie site but every site i try my web scraper doesn't return anything rather just prints "Process finished with exit code 0" but doesn't return anything.
Can someone help me crack open the containers and get out the contents.
i have tried all the above sites for almost a month but with no success. The problem seems to be with the exact div, class or probably span element layout.
https://www.betlion.co.ug/
https://www.betpawa.ug/
https://www.premierbet.ug/
for example i tried link 2 in the code as shown
import requests
from bs4 import BeautifulSoup
url = "https://www.betpawa.ug/"
response = requests.get (url, timeout=5)
content = BeautifulSoup (response.content, "html.parser")
for match in content.findAll("div",attrs={"class":"events-container prematch", "id":"Bp-Event-591531"}):
print (match.text.strip())
i expect the program to return a list of matches, odds and all the other components of the container. however the program runs and just prints " "Process finished with exit code 0" nothing else
it looks like the base site gets loaded in two phases
Load some HTML structure for the page,
Use JavaScript to fill in the contents
You can prove this to yourself by right clicking on the page, do "view page source" and then searching for "events-container" (it is not there).
So you'll need something more powerful than requests + bs4. I have heard of folks using Selenium to do this, but I'm not familiar with it.
You should consider using urllib3 instead of requests.
from urllib.request import Request, urlopen.
- build your req:
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
- retrieve document:
res = urlopen(req)
- parse it using bs4:
html = BeautifulSoup (res, 'html.parser')
Like Chris Curvey described, the problem is that requests can't execute the JavaScript of the page. If you print your content variable you can see that the page would display a message like: "JavaScript Required! To provide you with the best possible product, our website requires JavaScript to function..." With Selenium you control an full browser in form of an WebDriver (for eample ChromeDriver binary for the Google Chrome Browser):
from bs4 import BeautifulSoup
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument('headless')
driver = webdriver.Chrome(chrome_options = chrome_options)
url = "https://www.betpawa.ug/"
driver.get(url)
page = driver.page_source
content = BeautifulSoup(page, 'html.parser')
for match in content.findAll("div",attrs={"class":"events-container"}):
print (match.text.strip())
Update:
In Line 13 the command print (match.text.strip()) simply extract only the text elements for each match-div's wich has the class-attribute "events-container".
If you want to extract more specific content you can access each match over the match variable.
You need to know:
which of the avabile information you want
and how to indentify this information inside the match-div's
structure.
in which data-type you need this information
To make it easy run the program, open the developer tools of chrome with key F12, on the left top corner you see now the icon for "select an element ...",
if you click on the icon and click in the browser on the desired element you see in the area under the icon the equivalent source.
Analyse it carefully to get the info's you need, for example:
The Title of the Football match is the first h3-Tag in the match-div
and is an string
The Odd's shown are span-tag's with the class event-odds and an
number (float/double)
Search the function you need in Google or in the reference to the package you use (BeautifulSoup4).
Let's try to get it quick and dirty by using the BeautifulSoup functions on the match variable to don't get the elements of the full site (have replaced the whitespace with tabs):
# (1) lets try to find the h3-tag
title_tags = match.findAll("h3") # use on match variable
if len(title_tags) > 0: # at least one found?
title = title_tags[0].getText() # get the text of the first one
print("Title: ", title) # show it
else:
print("no h3-tags found")
exit()
# (2) lets try to get some odds as numbers in the order in which they are displayed
odds_tags = match.findAll("span", attrs={"class":"event-odds"})
if len(odds_tags) > 2: # at least three found?
odds = [] # create an list
for tag in odds_tags: # loop over the odds_tags we found
odd = tag.getText() # get the text
print("Odd: ", odd)
# good but it is an string, you can't compare it with an number in
# python and expect an good result.
# You have to clean it and convert it:
clean_odd = odd.strip() # remove empty spaces
odd = float(clean_odd) # convert it to float
print("Odd as Number:", odd)
else:
print("something wen't wrong with the odds")
exit()
input("Press enter to try it on the next match!")
I'm trying to pull prices from Binance's home page and BeautifulSoup returns empty elements for me. Binance's home page is at https://www.binance.com/en/, and the interesting block I'm trying to get text from is:
<div class="sc-62mpio-0-sc-iAyFgw iQwJlO" color="#999"><span>"/" "$" "35.49"</span></div>
On Binance's home page is a table and one of the columns is titled "Last Price". Next to the last price is the last USD price in a faded gray color and I'm trying to pull every one of those. Here's my code so far.
def grabPrices():
page = requests.get("https://www.binance.com/en")
soup = BeautifulSoup(page.text, "lxml")
prices = soup.find_all("span", {"class": None})
print(prices)
But the output is just a large array of "–" tags.
Selenium should be one way of scraping the table content you want from this biniance page. And google Selenium about its set up (pretty much by download a driver and place it in your local disk, if you are a chrome user, see this download link chrome driver). Here is my code to access the content you are interested:
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import time
driver = webdriver.Chrome(executable_path=r'C:\chromedriver\chromedriver.exe')
time.sleep(3) # Allow time to launch the controlled web
driver.get('https://www.binance.com/en/')
time.sleep(3) # Allow time to load the page
sel = Selector(text=driver.page_source)
Table = sel.xpath('//*[#id="__next"]/div/main/div[4]/div/div[2]/div/div[2]/div/div[2]/div')
Table.extract() # This basically gives you all the content of the table, see follow screen shot (screen shot is truncated for display purpose)
Then if you further process the entire table content with something like:
tb_rows = Table.xpath('.//div/a//div//div//span/text()').extract()
tb_rows # Then you will get follow screen shot
At this point, the result is narrowed down to pretty much what you are interested, but notice that the lastprice's two components (number/dollar price) are stored in two tag in source page, so we can do following to combine them together and reach to the destination:
for n in range(0,len(tb_rows),2):
LastPrice = tb_rows[n] + tb_rows[n+1]
print(LastPrice) # For sure, other than print, you could store each element in a list
driver.quit() # don't forget to quit driver by the end
The final output looks like:
i am trying to scrape information from this link https://www.hopkinsguides.com/hopkins/index/Johns_Hopkins_ABX_Guide/Antibiotics
This site uses jquery. My goal is to scrape all the antibiotic names, then for each antibiotic scrape "NON-FDA APPROVED USES" which is contained in a separate link. I hope i'm making sense.
The antibiotics are in categories that contain MANY other subcategories that contain the rest of antibiotics with their respective link.
My program first logs in, and the clicks on the first 7 buttons to expand and show more categories. I used driver.find_element_by_x_path to expand the first layer but i cant expand the second layer the same way (by looping through x_path) because if i do it will end up taking me to the other page where the "NON-FDA APPROVED USES" info is contained instead of expanding the page.
It does so because once u expand the first layer, then the second layer now contains more buttons/subcategories AND links that take you to the page where "NON-FDA APPROVED USES".
So if these are my x_paths
#//*[#id="firstul"]/li[1]/a
#//*[#id="firstul"]/li[2]/a
li[1] could be a redirecting link,
li[2] could be a button that shows more links(which is what i want first)
I made a soup to separate the buttons from links but now i cant click on the "a" tags i printed out in the bottom for loop.
Any ideas on how i should go about this?? Thanks in advance.
Here's my code.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from random import randint
from bs4 import BeautifulSoup
#SIGN-IN
driver = webdriver.Chrome()
driver.get("http://www.hopkinsguide.com/home")
url = "https://www.hopkinsguides.com/hopkins/index/"
assert "Hopkins" in driver.title
sign_in_button = driver.find_element_by_xpath('//*[#id="logout"]')
sign_in_button.click()
user_elem = driver.find_element_by_name('username')
pass_elem = driver.find_element_by_id('dd-password')
user_elem.send_keys("user")
time.sleep(2)
pass_elem.send_keys("pass")
time.sleep(2)
sign_in_after_input = driver.find_element_by_xpath('//*[#id="dd-login-button"]')
sign_in_after_input.click()
def expand_page():
req = driver.get("https://www.hopkinsguides.com/hopkins/index/Johns_Hopkins_ABX_Guide/Antibiotics")
time.sleep(randint(2, 4))
#expand first layer
for i in range(1, 8):
driver.find_element_by_xpath("//*[#id='firstul']/li[" + str(i) + "]/a").click()
time.sleep(2)
html = driver.page_source
soup = BeautifulSoup(html, features='lxml')
for i in soup.find_all('a'):
if i.get('data-path') != None:
print(i)
time.sleep(2)
expand_page()
To expand all the values this should work for you, this will expand all the first level values and keep checking if any child values are expandable by checking the role attribute of element recursively:
def click_further(driver, elem):
subs = WebDriverWait(driver, 5).until(lambda driver: elem.find_elements_by_xpath("./following-sibling::ul//li/a"))
for sub in subs:
if sub.get_attribute('role') == "button":
sub.click()
click_further(driver, sub)
for idx in range(1,8):
elem = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, "//*[#id='firstul']/li[{}]/a".format(idx))))
elem.click()
click_further(driver, elem)
I guess then you can figure out how to get the text which you want to extract from it.
I suppose you want to expand all the expandable nodes first before accessing the underlying links one by one. From what I can see of the site, the discriminating attribute would be <li class="expandable index-expand"> and <li class="index-leaf">.
You can use Selenium to locate the "expandable index-expand" classes and click the nested <a> tag first. Then, repeat the same operation for the expanded child layer each time you click. Once you no longer detect "expandable index-expand" classes in the child layer, you can proceed to grab the links from "index-leaf".
find_elements_by_class_name should do the trick