I'm trying to write a program that scrapes https://www.tcgplayer.com/ to get a list of Pokemon TCG prices based on a specified list
from lxml import etree, html
import requests
import string
def clean_text(element):
all_text = element.text_content()
cleaned = ' '.join(all_text.split())
return cleaned
page = requests.get("http://www.tcgplayer.com/product/231462/pokemon-first-partner-pack-pikachu?xid=pi731833d1-f2cc-4043-9551-4ca08506b43a&page=1&Language=English")
tree = html.fromstring(page.content)
price = tree.xpath("/html/body/div[2]/div/div/section[2]/section/div/div[2]/section[3]/div/section[1]/ul/li[1]/span[2]")
print(price)
However, when I am running this code the output ends up just being an empty list "[]"
I have tried using selenium and the browser function that it has, however I would like it to not need to open a browser for 100+ cards to get the price data. I have tested this code on another website url and xpath (https://www.pricecharting.com/game/pokemon-promo/jolteon-v-swsh183, /html/body/div[1]/div[2]/div/div/table/tbody[1]/tr[1]/td[4]/span[1]) - so I wonder if it is just how https://www.tcgplayer.com/ is built.
The expected return value is around $5
Question answered above by #Grismar:
When you test the XPath on a site, you probably do this in the Developer Console in the browser, after the page has loaded. At that point in time, any JavaScript will have already executed and completed and the page may have been updated or even been constructed from scratch by it. When using requests, it just loads the basic page and no scripts get executed - you'll need something that can execute JavaScript to get the same result, like selenium
BeautifulSoup scraping returns no data
Related
I want to export all store data from the following website into a excel-file:
https://www.ybpn.de/ihre-parfuemerien
The problem: The Map is "dynamic", so the needed data loads when you enter a postal code.
The data is need is stored in the div-class "storefinder__list-item" with a unique reference in the data-"storefinder-reference" div-class, example: data-storefinder-reference="132"
I tried:
soup.find("div", {"data-storefinder-reference": "132"})
But the output is: NONE
I think this problem is caused by the fact that the page is dynamic, so the needed data loads then, when you enter a postal code. So when I search for the reference id "132" its "there", but not loaded on the website and bs4 cant find this id.
Any ideas to improve the code?
For this you might need to look into tools like selenium and/or "firefox-headless".
Especially selenium allows you to "remote-control" web-pages with Python
Here is a tutorial: https://realpython.com/modern-web-automation-with-python-and-selenium/
If the problem is waiting for the page to load, you can do it with selenium.
`result = driver.execute_script('var text = document.title ; return text')`
If there is jquery on the page, it certainly does
result=driver.execute_script("""
$(document).ready({
var $text=$('yourselector').text()
return $text
})
""")
Note: For selenium you can look here
You could just open the page in chrome or ff, open the web debug console and query the elements. if you see them they are in the dom and thus queryable. But that will be done in javascript. if you‘re lucky they use jQuery.
I can't figure out why BS4 is not seeing the text inside of the span in the following scenario:
Page: https://pypi.org/project/requests/
Text I'm looking for - number of stars on the left hand side (around 43,000 at the time of writing)
My code:
stars = soup.find('span', {'class': 'github-repo-info__item', 'data-key': 'stargazers_count'}).text
also tried:
stars = soup.find('span', {'class': 'github-repo-info__item', 'data-key': 'stargazers_count'}).get_text()
Both return an empty string ''. The element itself seems to be located correctly (I can browse through parents / siblings in PyCharm debugger without a problem. Fetching text in other parts of the website also works perfectly fine. It's just the github-related stats that fail to fetch.
Any ideas?
Because this page use Javascript to load the page dynamically.So you couldn't get it directly by response.text
The source code of the page:
You could crawl the API directly:
import requests
r = requests.get('https://api.github.com/repos/psf/requests')
print(r.json()["stargazers_count"])
Result:
43010
Using bs4, we can't scrape stars rate.
After inspecting the site, please check response html.
There, there is class information named "github-repo-info__item", but there is no text information.
in this case, use selenium.
Set-up
I'm using scrapy to scrape housing ads.
For each ad, I'm trying to obtain info on year of construction.
This info is stated in most ads.
Problem
I can see the year of construction and the other info around it in the about section when I check the ad in the browser and its HTML code in developer mode.
However, when I use Scrapy I get returned an empty list. I can scrape other parts of the ad page (price, rooms, etc.), but not the about section.
Check this example ad.
If I use response.css('#caracteristique_bien').extract_first(), I get,
<div id="caracteristique_bien"></div>
That's as far as I can go. Any deeper returns emptiness.
How can I obtain the year of construction?
As I mentioned, this is rendered using javascript, which means that some parts of the html will be loaded dynamically by the browser (Scrapyis not a browser).
The good thing for this case is that the javascript is inside the actual request, which means you can still parse the information that information, but differently.
for example to get the description, you can find it inside:
import re
import demjson
script_info = response.xpath('//script[contains(., "Object.defineProperty")]/text()').extract_first()
# getting description
description_json = re.search("descriptionBien', (\{.+?\});", script_info, re.DOTALL)
real_description = demjson.decode(description_json)['value']
# getting surface area
surface_json = re.search("surfaceT', (\{.+?\})\);", script_info, re.DOTALL).group(1)
real_surface = demjson.decode(surface_json)['value']
...
As you can see script_info contains all the information, you just need to come up with a way to parse that to get what you want
But there is some information that isn't inside the same response. To get it you need to do a GET request to:
https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359
As you can see, it only requires the idannonce, which you can get from the previous response with:
demjson.decode(re.search("idAnnonce', (\{.+?\})\);", script_info, re.DOTALL).group(1))['value']
Later with the second request, you can get for example the "construction year" with:
import json
...
[y for y in [x for x in json.loads(response.body)['categories'] if x['name'] == 'Général'][0]['criteria'] if 'construction' in y['value']][0]['value']
Loaded the page, opened devtools of the browser, and did a ctrl-F with the css selector you used (caracteristique_bien), and found out this request: https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=139747359
where you can find what you are looking for
Looking at your example, the add is loaded dynamically with javascript so you won't be able to get it via scrapy.
You can use Selenium for (massive) scraping (I did similar things on a famous french ads website)
Just use it headless with Chrome options and this will be fine :
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(options = options)
I am trying to create a python script to scrape the public county records website. I ultimately want to be able to have a list of owner names and the script run through all the names and pull the most recent deed of trust information (lender name and date filed). For the code below, I just wrote the owner name as a string 'ANCHOR EQUITIES LTD'.
I have used Selenium to automate the entering of owner name into form boxes but when the 'return' button is pressed and my results are shown, the website url does not change. I try to locate the specific text in the table using xpath but the path does not exist when I look for it. I have concluded the path does not exist because it is searching for the xpath on the first page with no results shown. BeautifulSoup4 wouldn't work in this situation because parsing the url would only return a blank search form html
See my code below:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
browser.get('http://deed.co.travis.tx.us/ords/f?p=105:5:0::NO:::#results')
ownerName = browser.find_element_by_id("P5_GRANTOR_FULLNAME")
ownerName.send_keys('ANCHOR EQUITIES LTD')
docType = browser.find_element_by_id("P5_DOCUMENT_TYPE")
docType.send_keys("deed of trust")
ownerName.send_keys(Keys.RETURN)
print(browser.page_source)
#lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]/text()")
enter code here
I have commented out the variable that is giving me trouble.. Please help!!!!
If I am not explaining my problem correctly, please feel free to ask and I will clear up any questions.
I think you almost have it.
You match the element you are interested in using:
lenderNameElement = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]")
Next you access the text of that element:
lenderName = lenderNameElement.text
Or in a single step:
lenderName = browser.find_element_by_xpath("//*[#id=\"report_results\"]/tbody[2]/tr/td/table/tbody/tr[25]/td[9]").text
have you used following xpath?
//table[contains(#summary,"Search Results")]/tbody/tr
I have checked it's work perfect.In that, you have to iterate each tr
I've decided to take a swing at web scraping using Python (with lxml and requests). The webpage I'm trying to scrape to learn is: http://www.football-lineups.com/season/Real_Madrid/2013-2014
What I want to scrape is the table on the left of the webpage (the table with the scores and formations used). Here is the code I'm working with:
from lxml import html
import requests
page=requests.get("http://www.football-lineups.com/season/Real_Madrid/2013-2014")
tree=html.fromstring(page.text)
competition=tree.xpath('//*[#id="sptf"]/table/tbody/tr[2]/td[4]/font/text()')
print competition
The xpath that I input is the xpath that I copied over from Chrome. The code should normally return the competition of the first match in the table (i.e. La Liga). In other words, it should return the second row, fourth column entry (there is a random second column on the web layout, I don't know why). However, when I run the code, I get back an empty list. Where might this code be going wrong?
If you inspect the row source of the page you will see that the lineup table is not there.
It is fed after loading the page using AJAX so you wont be able to fetch it only by getting http://www.football-lineups.com/season/Real_Madrid/2013-2014 since the JS won't be interpreted and thus the AJAX not executed.
The AJAX request is the following:
URL: http://www.football-lineups.com/ajax/get_sectf.php
method: POST
data: d1=3&d2=-2013&d3=0&d4=1&d5=0&d6=1&d7=20&d8=0&d9=&d10=0&d11=0&d12=undefined
Maybe you can forge the request to get this data. I'll let you analyse what are those well named dX arguments :)
Here, I give full code which fulfill your requirement:
from selenium import webdriver
import csv
url="http://www.football-lineups.com/season/Real_Madrid/2013-2014"
driver=webdriver.Chrome('./chromedriver.exe')
driver.get(url)
myfile = open('demo.csv', 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
tr_list=driver.find_elements_by_xpath("//span[#id='sptf']/table/tbody/tr")
for tr in tr_list:
lst=[]
for td in tr.find_elements_by_tag_name('td'):
lst.append(td.text)
wr.writerow(lst)
driver.quit()
myfile.close()