Python Requests Module Doesn't Fetch All The Elements It Ought To - python

This piece of code works properly
from lxml import html
import requests
page = requests.get(c)
tree = html.fromstring(page.content)
link = tree.xpath('//script/text()')
But it doesn't fetch the whole content. Like it is hidden or something.
I can see this is the case because the next thing I do is this
print len(link)
and it returns nine (9)
I then go to the page, which is the string c, above in the code. I go to the source (view-source:) with mozilla. And I hit ctr+f and I write <script with a space in the end.
It returns me thirty three (33) matches. The one I want cannot be fetched.
What's happening? I can't understand. Am I blocked or something? How can I bypass this and make requests module see what mozilla is seeing?

If you try
tree.xpath('//script')
I hope you will get 33 matches.
On your page only nine elements contain something between the opening and closing tags.

Related

I have created a list using find_all in Beautiful soup based on an attribute. How do I return he node I want?

I have a MS word document template that has Structured documents tags, including repeating sections. I am using a Python script to pull the important parts and and send them to a dataframe. My script works as intended on 80% of the documents I have attempted but I am often failing. The issue is when finding the first repeating section I have been doing the following:
from bs4 import BeautifulSoup as BS
soup = BS(f, 'xml') # entire xml; file is called soup
soupdocument=soup.document #document only child node of soup
soupbody=soupdocument.body # body is the only child node of document
ODR=soupbody.contents[5]
which often works however some users have managed to hit enter in some places in the document that are not locked down. I know the issue should be resolved by not choosing the 5th element of soupbody.
soupbody.find_all({tag})
><w:tag w:val="First Name"/>,
<w:tag w:val="Last Name"/>,
<w:tag w:val="Position"/>,
<w:tag w:val="Phone Number"/>,
<w:tag w:val="Email"/>,
<w:tag w:val="ODR Repeating Section"/>,
the above is a partial list of what is returned the actual list several dozen tags and some are repeated. the section I want is the last one I listed above and is usually but not always found by the first code block. I believe I can put a colon after find_all({tag:SOMETHING}} I have tried cutting and pasting all different parts of "ODR Repeating Section" but It doesn't work. What is the correct way to find this section?
Hi perhaps specify the attribute you're searching for in addition to the tag name?
tags = soup.findAll('tag', {'val" : 'ODR Repeating Section'})

xpath how to format path

I would like to get #src value '/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg' from webpage
from lxml import html
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
session = requests.session()
page = session.get(URL)
HTMLn = html.fromstring(page.content)
print HTMLn.xpath('//html/body/div[1]/div/div/div[3]/div[19]/div/a[2]/div/div/img/#src')[0]
but I can't. No matter how I format xpath, i tdooesnt work.
In the spirit of #pmuntima's answer, if you already know it's the 14th sourced image, but want to stay with lxml, then you can:
print HTMLn.xpath('//img/#data-src')[14]
To get that particular image. It similarly reports:
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg
If you want to do your indexing in XPath (possibly more efficient in very large result sets), then:
print HTMLn.xpath('(//img/#data-src)[14]')[0]
It's a little bit uglier, given the need to parenthesize in the XPath, and then to index out the first element of the list that .xpath always returns.
Still, as discussed in the comments above, strictly numerical indexing is generally a fragile scraping pattern.
Update: So why is the XPath given by browser inspect tools not leading to the right element? Because the content seen by a browser, after a dynamic JavaScript-based update process, is different from the content seen by your request. Your request is not running JS, and is doing no such updates. Different content, different address needed--if the address is static and fragile, at any rate.
Part of the updates here seem to be taking src URIs, which initially point to an "I'm loading!" gif, and replacing them with the "real" src values, which are found in the data-src attribute to begin.
So you need two changes:
a stronger way to address the content you want (a way that doesn't break when you move from browser inspect to program fetch) and
to fetch the URIs you want from data-src not src, because in your program fetch, the JS has not done its load-and-switch trick the way it did in the browser.
If you know text associated with the target image, that can be the trick. E.g.:
search_phrase = 'DECK SANTA CRUZ STAR WARS EMPIRE STRIKES BACK POSTER'
path = '//img[contains(#alt, "{}")]/#data-src'.format(search_phrase)
print HTMLn.xpath(path)[0]
This works because the alt attribute contains the target text. You look for images that have the search phrase contained in their alt attributes, then fetch the corresponding data-src values.
I used a combination of requests and beautiful soup libraries. They both are wonderful and I would recommend them for scraping and parsing/extracting HTML. If you have a complex scraping job, scrapy is really good.
So for your specific example, I can do
from bs4 import BeautifulSoup
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
r = requests.get(URL)
soup = BeautifulSoup(r.text, "html.parser")
specific_element = soup.find_all('a', class_="product-icon")[14]
res = specific_element.find('img')["data-src"]
print(res)
It will print out
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg

why the results of response.xpath('//html') differs than response.body?

I'm trying to parse this page using scrapy
http://mobileshop.ae/one-x
I need to extract the links of the products.
The problem is the links are available in the response.body result, but no available if you try response.xpath('//body').extract()
the results of response.body and response.xpath('//body') are different.
>>> body = response.body
>>> body_2 = response.xpath('//html').extract()[0]
>>> len(body)
238731
>>> len(body_2)
67520
same short result for response.xpath('.').extract()[0]
is there any idea why this happens, and how can I extract the data needed ?
So, the issue here is a lot of mal-formed content in that page, including several unclosed tags. One way to solve this problem is to use lxml's soupparser to parse the mal-formed content (using BeautifulSoup under the covers) and build a Scrapy Selector with it.
Example session with scrapy shell http://mobileshop.ae/one-x:
>>> from lxml.html import soupparser
>>> from scrapy import Selector
>>> sel = Selector(_root=soupparser.fromstring(response.body))
>>> sel.xpath('//h4[#class="name"]/a').extract()
[u'HTC One X 3G 16GB Grey',
u'HTC One X 3G 16GB White',
u'HTC One X 3G 32GB Grey',
u'HTC One X 3G 32GB White']
Note that using the BeautifulSoup parser is a lot slower than lxml's default parser. You probably want to do this only in the places where it's really needed.
Response.xpath("//body") returns body of 'html' element contained in response, while response.body returns body (or message-body) of whole HTTP response (so all html in response including head & body elements).
Response.xpath("//body") is actually shortcut that converts body of HTTP response into Selector object that can be navigated with xpaths.
Links that you need are contained in html element body, they cannot really be anywhere else, I'm not sure why you suggest that they are not there. response.xpath("//body//a/#href") will give you all links on page, you probably need to create proper xpath that will select only those links you need.
The length of response.xpath("//body") that you mention in your example results from the fact that your first example len(response.xpath('//body').extract()) returns the number of body elements in html document, response.xpath.extract() returns list of elements matching xpath. There is only one body in document. In your second example len(response.xpath('//body')).extract()[0]) you are actually getting body element as string, and you are getting length of string (number of characters contained in body). len(response.body) also gives you number of characters in whole HTTP response, the number is higher most likely because html HEAD contains lots of scripts and stylesheets, that are not present in HTML body.

Finding urls containing a specific string

I haven't used RegEx before, and everyone seems to agree that it's bad for webscraping and html in particular, but I'm not really sure how to solve my little challenge without.
I have a small Python scraper that opens 24 different webpages. In each webpage, there's links to other webpages. I want to make a simple solution that gets the links that I need and even though the webpages are somewhat similar, the links that I want are not.
The only common thing between the urls seems to be a specific string: 'uge' or 'Uge' (uge means week in Danish - and the week number changes every week, duh). It's not like the urls have a common ID or something like that I could use to target the correct ones each time.
I figure it would be possible using RegEx to go through the webpage and find all urls that has 'uge' or 'Uge' in them and then open them. But is there a way to do that using BS? And if I do it using RegEx, how would a possible solution look like?
For example, here are two of the urls I want to grab in different webpages:
http://www.domstol.dk/KobenhavnsByret/retslister/Pages/Uge45-Tvangsauktioner.aspx
http://www.domstol.dk/esbjerg/retslister/Pages/Straffesageruge32.aspx
This should work... The RegEx uge\d\d? tells it to find "uge" followed by a digit, and possibly another one.
import re
for item in listofurls:
l = re.findall("uge\d\d?", item, re.IGNORECASE):
if l:
print item #just do whatever you want to do when it finds it
Yes, you can do this with BeautifulSoup.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html_string)
# To find just 'Uge##' or 'uge##', as specified in the question:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("[Uu]ge\d+"))]
# To find without regard to case at all:
urls = [el["href"] for el in soup.findAll("a", href=re.compile("(?i)uge\d+"))]
Or just use a simple for loop:
list_of_urls = ["""LIST GOES HERE"""]
for url in list_of_urls:
if 'uge' in url.lower():
# Code to execute
The regex expression would look something like: uge\d\d

Python strategy for extracting text from malformed html pages

I'm trying to extract text from arbitrary html pages. Some of the pages (which I have no control over) have malformed html or scripts which make this difficult. Also I'm on a shared hosting environment, so I can install any python lib, but I can't just install anything I want on the server.
pyparsing and html2text.py also did not seem to work for malformed html pages.
Example URL is http://apnews.myway.com/article/20091015/D9BB7CGG1.html
My current implementation is approximately the following:
# Try using BeautifulSoup 3.0.7a
soup = BeautifulSoup.BeautifulSoup(s)
comments = soup.findAll(text=lambda text:isinstance(text,Comment))
[comment.extract() for comment in comments]
c=soup.findAll('script')
for i in c:
i.extract()
body = bsoup.body(text=True)
text = ''.join(body)
# if BeautifulSoup can't handle it,
# alter html by trying to find 1st instance of "<body" and replace everything prior to that, with "<html><head></head>"
# try beautifulsoup again with new html
if beautifulsoup still does not work, then I resort to using a heuristic of looking at the 1st char, last char (to see if they looks like its a code line # < ; and taking a sample of the line and then check if the tokens are english words, or numbers. If to few of the tokens are words or numbers, then I guess that the line is code.
I could use machine learning to inspect each line, but that seems a little expensive and I would probably have to train it (since I don't know that much about unsupervised learning machines), and of course write it as well.
Any advice, tools, strategies would be most welcome. Also I realize that the latter part of that is rather messy since if I get a line that is determine to contain code, I currently throw away the entire line, even if there is some small amount of actual English text in the line.
Try not to laugh, but:
class TextFormatter:
def __init__(self,lynx='/usr/bin/lynx'):
self.lynx = lynx
def html2text(self, unicode_html_source):
"Expects unicode; returns unicode"
return Popen([self.lynx,
'-assume-charset=UTF-8',
'-display-charset=UTF-8',
'-dump',
'-stdin'],
stdin=PIPE,
stdout=PIPE).communicate(input=unicode_html_source.encode('utf-8'))[0].decode('utf-8')
I hope you've got lynx!
Well, it depends how good the solution has to be. I had a similar problem, importing hundreds of old html pages into a new website. I basically did
# remove all that crap around the body and let BS fix the tags
newhtml = "<html><body>%s</body></html>" % (
u''.join( unicode( tag ) for tag in BeautifulSoup( oldhtml ).body.contents ))
# use html2text to turn it into text
text = html2text( newhtml )
and it worked out, but of course the documents could be so bad that even BS can't salvage much.
BeautifulSoup will do bad with malformed HTML. What about some regex-fu?
>>> import re
>>>
>>> html = """<p>This is paragraph with a bunch of lines
... from a news story.</p>"""
>>>
>>> pattern = re.compile('(?<=p>).+(?=</p)', re.DOTALL)
>>> pattern.search(html).group()
'This is paragraph with a bunch of lines\nfrom a news story.'
You can then assembly a list of valid tags from which you want to extract information.

Categories

Resources