Python, Scrapy problems when parsing tables in Earning Reports - python

I am trying to parse some data from the table (the balance sheet) under every earning report. Here I use AMD as an example, but not limited to AMD.
Here is the link
The problem I have now is that I cannot get any reading - my spider always returns EMPTY result. I used scrapy shell "http://example.com" to test my xpath, which I directly copied from Google Chrome Inspector, and it still didn't work.
Here is my xpath (Chrome browser provided):
//*[#id="newsroom-copy"]/div[2]/div[8]/table/tbody/tr[9]/td[4]/text()
Here is my code:
import scrapy
class ESItem(scrapy.Item):
Rev = scrapy.Field()
class ESSpider(scrapy.Spider):
name = "es"
start_urls = [
'http://www.marketwired.com/press-release/amd-reports-2016-second-quarter-results-nasdaq-amd-2144535.htm',
]
def parse(self, response):
item = ESItem()
for earning in response.xpath('//*[#id="newsroom-copy"]/div[2]/div[8]/table/tbody'):
item['Rev'] = earning.xpath('tr[9]/td[4]/text()').extract_first()
yield item
I am looking for retrieving the "revenue numbers" from the table on the bottom of the report.
Thanks!
I run my code by using this command:
scrapy runspider ***.py -o ***.json
Code runs fine, no error, just didn't return what I really look for.
UPDATE: I kind of figure out something... I have to remove that "tbody" tag from the XPATH, which I don't understand... Can anyone explain this a little bit please?

The html provided by the inspect tool in chrome is the result of the browser interpretation of the actual code that it is sent by the server to your browser.
The tbody tag is a prime example. If you view the page source of a website you'll see a structure like this
<table>
<tr>
<td></td>
</tr>
</table>
Now if you inspect the page this happens
<table>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
What scrapy gets is the page source and not the "inspector" so whenever you try to select something in a page make sure it exists on the page source.
Another example of this is when you try to select some element that is generated by javascript while the page is being loaded. Scrapy won't get this either so you'll need to use something else to interpret it like scrapy-splash or selenium.
As a side note, take the time to learn xpath and css selectors. It's a time saver when you know how to query elements just right.
//*[#id='newsroom-copy']/div[2]/div[8]/table/tr[9]/td[4]/text()
is equivalent to
//table/tr[td/text()='Net revenue']/td[4]/text()
See how much nicer it looks?

Related

How to get the download link of html a tag which has no explicit true link?

I encountered a web page that has many download sign like
If I click on each of these download sign, the browser will start downloading a zip file.
However, it seems that these download sign are just images with no explicit download links can be copied.
I looked into the source of html. I figured out each download sign belong to a tr tag block as below.
<tr>
<td title="aaa">
<span class="centerFile">
<img src="/images/downloadCenter/pic.png" />
</span>aaa
</td>
<td>2021-09-10 13:42</td>
<td>bbb</td>
<td><a id="4099823" data='{"clazzId":37675171,"libraryId":"98689831","relationId":1280730}' recordid="4099823" target="_blank" class="download_ic checkSafe" style="line-height:54px;"><img src="/images/down.png" /></a></td>
</tr>
Click this link will download a zip file with download link
So my problem is how to get download links of these download sign without actually clicking them in the browser. In particular, I want to know how to do this using python by analyzing the source html so I could to do batch downloading?
If you want to do the batch download of those files, and are not able to find out links by analysis of html and javascript (because it's probably javascript function that creates this link, or javascript call to backend) then you can use selenium to simulate you acting as user.
You will need to do something like code below, where I'm using class name from html you present, where I think is call to javascript download function:
from selenium import webdriver
driver = webdriver.Chrome()
# URL of website
url = "https://www.yourwebsitelinkhere.com/"
driver.get(url)
# use class name to find anchor link
download_links = driver.find_elements_by_css_selector(".download_ic.checkSafe")
for link in download_links:
link.click()
Example how it works for stackoverflow (in the day of writing this answer)
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com")
elements = driver.find_elements_by_css_selector('.-marketing-link.js-gps-track')
elements[0].click()
And this should lead you to stackoverflow about site.
[EDIT] Answer edited, as it seems compound classes are not supported by selenium, example for stackoverflow added

Python selenium find element by xpath multiple conditions

I use python selenium to do web scraping. And Iwould like to catch website with both in a specific date (like [01-20]) and title with specific text in it (like 'public'), how could the code satisfy both ?
I tried the following but no luck
Thank you in advance!!
href:
<td width="89%" height="26">
sth sth public
</td>
<td width="8%" align="center">[01-20]</td>
<tr>
code:
titles = driver.find_elements_by_css_selector("[title*='public']")
for title in titles:
links=[title.get_attribute('href') for title in driver.find_elements_by_xpath("//td[text()='[01-20]']/preceding::td[1]/a")]
urls = [links.get_attribute("href") for links in driver.find_elements_by_css_selector("[title*='public']")]
for url in urls:
print(url)
driver.get(url)
###do something
use keyword and and contains function in xpath:
'//td[text()="[01-20]"]/preceding::td[1]/a[contains(#title, "资本")]'
check this video for more info
EDIT: changed xpath to a working answer

Replace scrapy response.body with selenium response

I try to crawl following product-site from an online-shop with scrapy: https://www.mediamarkt.de/de/product/_lg-65uk6470plc-2391592.html'
The properties of the product are listed in a normal html-table and some of them are getting showed only when the "Alle Details einblenden"-button was clicked.
The properties are safed in a js-var and are preloaded from the begining. By pressing the button, a js-function adds the rest of the properties to the table.
Now I try to get the full content of the webpage and then to crawl it completly.
By the reason, that I need to use the SitemapSpider by scrapy, I decided to use selenium to get the content of this site, then to simulate clicking the button and replace the full content with the scrapy response.body. Afterwards, when the data gets parsed, scrapy should parse the new properties from the table too. But it doesn't work and I really don't know why. The properties, which are shown from the beginning, are getting parsed sucessfully.
chromeDriver = webdriver.Chrome('C:/***/***/chromedriver.exe') #only for testing
def parse(self,response):
chromeDriver.get(response.url)
moreContentButton = chromeDriver.find_element_by_xpath('//div[#class="mms-product-features__more"]/span[#class="mms-link underline"]')
chromeDriver.execute_script('arguments[0].click();', moreContentButton)
newHTMLBody = chromeDriver.page_source.encode('utf-8')
response._set_body(newHTMLBody)
scrapyProductLoader = ItemLoader(item=Product(), response=response)
scrapyProductLoader.add_xpath('propertiesKeys', '//tr[#class="mms-feature-list__row"]/th[#class="mms-feature-list__dt"]')
scrapyProductLoader.add_xpath('propertiesValues', '//tr[#class="mms-feature-list__row"]/td[#class="mms-feature-list__dd"]')
I tried the response.replace(body=chromeDriver.page_source) method instead of response._set_body(newHTMLBody), but that doesn't worked. It changes nothing. I know that response.body contains all properties of the product (by creating a html-file containing the response.body), but scrapy adds only the properties of the product before the button was clicked (in this example: Betriebssystem: webOS 4.0 (AI ThinQ) is the last entry).
But I need all properties.
Here is a part of the reponse.body before the ItemLoader got initialized:
<tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Betriebssystem</th>
<td class="mms-feature-list__dd">webOS 4.0 (AI ThinQ)</td></tr>
<tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Prozessor</th>
<td class="mms-feature-list__dd">Quad Core-Prozessor</td></tr><tr class="mms-feature-list__row">
<th scope="row" class="mms-feature-list__dt">Energieeffizienzklasse</th>
<td class="mms-feature-list__dd">A</td></tr>
</tbody></table></div>
<div class="mms-feature-list mms-feature-list--rich">
<h3 class="mms-headline">Bild</h3>
<table class="mms-feature-list__container">
<tbody><tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Bildschirmauflösung</th>
<td class="mms-feature-list__dd">3.840 x 2.160 Pixel</td></tr>
<tr class="mms-feature-list__row"><th scope="row" class="mms-feature-list__dt">Bildwiederholungsfrequenz</th>
<td class="mms-feature-list__dd">True Motion 100</td></tr>
Thanks for your attention and your help.
You don't need selenium or anything else to get the desired data from the mentioned page.
import json
text_data = response.css('script').re('window.__PRELOADED_STATE__ = (.+);')[0]
# This dict will contain everything you need.
data = json.loads(text_data)
Selenium is a testing tool. Avoid using it for scraping.
You could proabley do this
>>> from scrapy.http import HtmlResponse
>>> response = HtmlResponse(url="Any URL HERE", body=BODY_STRING_HERE, encoding='utf-8')
>>> response.xpath('xpath_here').extract()

Python BeautifulSoup find span inside class

I am trying to create a python script which finds a specific test inside a spam which comes from a class. Unfortunately i keep getting an empty response or "none".
It comes from a very specific page so ill paste a small bit of it which im trying to find:
<tbody>
<tr class="zone-dedicated-availability" data-actions="refUnavailable" data-dc="" data-ref="160sk5" data-availability="3600-">
<td class="show-on-ref-unavailable elapsed-time-since-last-delivery" colspan="5">
<span qtlid="47402">
Last server delivered: today at 01:59.
</span><br><a style="font- size:14px;" href=".." qtlid="50602">Go for a VPS-CLOUD<br><span style="font-size:0.9em;" qtlid="50615">(from £5.99 excl.VAT)</span></a>
</td>
I am trying to get the "last server delivered" tekst from my script. I am still learning so would appreciate the help:
page = requests.get('...')
tree = page.content
soup = BeautifulSoup(tree)
table = soup.find('tbody', {'class': 'zone-dedicated-availability'})
print table
I am probably missing some at the find statement as this is where im stuck at now, tried a few different things but not sure how i can get a valid output like i need to.
The class attribute is in tr so you need to use this:
table = soup.find('tbody').find('tr', {'class': 'zone-dedicated-availability'})
or even better:
table = soup.find('tr', {'class': 'zone-dedicated-availability'})
You can also use a CSS selector and the select method:
soup.select('tbody tr.zone-dedicated-availability')
To get the data you want is in the first "span" with qtlid="47402" thus:
In [19]: soup.find('tr', class_='zone-dedicated-availability').find('span', qtlid='47402').get_text(strip=True)
Out[19]: 'Last server delivered: today at 01:59.'
Have you tried looking for a table row with the class of "zone-dedicated-availability"? It seems that you are currently searching for a table body with that class, and that it is unable to find it.

Why does this xpath fail using lxml in python?

Here is an example web page I am trying to get data from.
http://www.makospearguns.com/product-p/mcffgb.htm
The xpath was taken from chrome development tools, and firepath in firefox is also able to find it, but using lxml it just returns an empty list for 'text'.
from lxml import html
import requests
site_url = 'http://www.makospearguns.com/product-p/mcffgb.htm'
xpath = '//*[#id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'
page = requests.get(site_url)
tree = html.fromstring(page.text)
text = tree.xpath(xpath)
Printing out the tree text with
print(tree.text_content().encode('utf-8'))
shows that the data is there, but it seems the xpath isn't working to find it. Is there something I am missing? Most other sites I have tried work fine using lxml and the xpath taken from chrome dev tools, but a few I have found give empty lists.
1. Browsers frequently change the HTML
Browsers quite frequently change the HTML served to it to make it "valid". For example, if you serve a browser this invalid HTML:
<table>
<p>bad paragraph</p>
<tr><td>Note that cells and rows can be unclosed (and valid) in HTML
</table>
To render it, the browser is helpful and tries to make it valid HTML and may convert this to:
<p>bad paragraph</p>
<table>
<tbody>
<tr>
<td>Note that cells and rows can be unclosed (and valid) in HTML</td>
</tr>
</tbody>
</table>
The above is changed because <p>aragraphs cannot be inside <table>s and <tbody>s are recommended. What changes are applied to the source can vary wildly by browser. Some will put invalid elements before tables, some after, some inside cells, etc...
2. Xpaths aren't fixed, they are flexible in pointing to elements.
Using this 'fixed' HTML:
<p>bad paragraph</p>
<table>
<tbody>
<tr>
<td>Note that cells and rows can be unclosed (and valid) in HTML</td>
</tr>
</tbody>
</table>
If we try to target the text of <td> cell, all of the following will give you approximately the right information:
//td
//tr/td
//tbody/tr/td
/table/tbody/tr/td
/table//*/text()
And the list goes on...
however, in general browser will give you the most precise (and least flexible) XPath that lists every element from the DOM. In this case:
/table[0]/tbody[0]/tr[0]/td[0]/text()
3. Conclusion: Browser given Xpaths are usually unhelpful
This is why the XPaths produced by developer tools will frequently give you the wrong Xpath when trying to use the raw HTML.
The solution, always refer to the raw HTML and use a flexible, but precise XPath.
Examine the actual HTML that holds the price:
<table border="0" cellspacing="0" cellpadding="0">
<tr>
<td>
<font class="pricecolor colors_productprice">
<div class="product_productprice">
<b>
<font class="text colors_text">Price:</font>
<span itemprop="price">$149.95</span>
</b>
</div>
</font>
<br/>
<input type="image" src="/v/vspfiles/templates/MAKO/images/buttons/btn_updateprice.gif" name="btnupdateprice" alt="Update Price" border="0"/>
</td>
</tr>
</table>
If you want the price, there is actually only one place to look!
//span[#itemprop="price"]/text()
And this will return:
$149.95
The xpath is simply wrong
Here is snippet from the page:
<form id="vCSS_mainform" method="post" name="MainForm" action="/ProductDetails.asp?ProductCode=MCFFGB" onsubmit="javascript:return QtyEnabledAddToCart_SuppressFormIE();">
<img src="/v/vspfiles/templates/MAKO/images/clear1x1.gif" width="5" height="5" alt="" /><br />
<table width="100%" cellpadding="0" cellspacing="0" border="0" id="v65-product-parent">
<tr>
<td colspan="2" class="vCSS_breadcrumb_td"><b>
Home >
You can see, that element with id being "v65-product-parent" is of typetableand has subelementtr`.
There can be only one element with such id (otherwise it would be broken xml).
The xpath is expecting tbody as child of given element (table) and there is none in whole page.
This can be tested by
>>> "tbody" in page.text
False
How Chrome came to that XPath?
If you simply download this page by
$ wget http://www.makospearguns.com/product-p/mcffgb.htm
and review content of it, it does not contain a single element named tbody
But if you use Chrome Developer Tools, you find some.
How it comes here?
This often happens, if JavaScript comes into play and generates some page content when in the browser. But as LegoStormtroopr noted, this is not our case and this time it is the browser, which modifies document to make it correct.
How to get content of page dynamically modified within browser?
You have to give some sort of browser a chance. E.g. if you use selenium, you would get it.
byselenium.py
from selenium import webdriver
from lxml import html
url = "http://www.makospearguns.com/product-p/mcffgb.htm"
xpath = '//*[#id="v65-product-parent"]/tbody/tr[2]/td[2]/table[1]/tbody/tr/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/div/table/tbody/tr/td/font/div/b/span/text()'
browser = webdriver.Firefox()
browser.get(url)
html_source = browser.page_source
print "test tbody", "tbody" in html_source
tree = html.fromstring(html_source)
text = tree.xpath(xpath)
print text
what prints
$ python byselenimum.py
test tbody True
['$149.95']
Conclusions
Selenium is great when it comes to changes within browser. However it is a bit heavy tool and if you can do it simpler way, do it that way. Lego Stormrtoopr have proposed such a simpler solution working on simply fetched web page.
I had a similar issue (Chrome inserting tbody elements when you do Copy as XPath). As others answered, you have to look at the actual page source, though the browser-given XPath is a good place to start. I've found that often, removing tbody tags fixes it, and to test this I wrote a small Python utility script to test XPaths:
#!/usr/bin/env python
import sys, requests
from lxml import html
if (len(sys.argv) < 3):
print 'Usage: ' + sys.argv[0] + ' url xpath'
sys.exit(1)
else:
url = sys.argv[1]
xp = sys.argv[2]
page = requests.get(url)
tree = html.fromstring(page.text)
nodes = tree.xpath(xp)
if (len(nodes) == 0):
print 'XPath did not match any nodes'
else:
# tree.xpath(xp) produces a list, so always just take first item
print (nodes[0]).text_content().encode('ascii', 'ignore')
(that's Python 2.7, in case the non-function "print" didn't give it away)

Categories

Resources