Extract Text from HTML div using Python and lxml - python

I'm trying to get python to extract text from one spot of a website. I've identified the HTML div:
<div class="number">76</div>
which is in:
...div/div[1]/div/div[2]
I'm trying to use lxml to extract the '76' from that, but can't get a return out of it other than:
[]
Here's my code:
from lxml import html
import requests
url = 'https://sleepiq.sleepnumber.com/#/##1'
values = {'username': 'my#gmail.com',
'password': 'mypassword'}
page = requests.get(url, data=values)
tree = html.fromstring(page.content)
hr = tree.xpath('//div[#class="number"]/text()')
print hr
Any suggestions? I feel this should be pretty easy, thanks in advance!
Update: the element I want is not contained in the page.content from requests.get
Updated Update: It looks like this is not logging me in to the page where the content I want is. It is only getting the login screen content.

Have you tried printing your page.content to make sure your requests.get is retrieving the content you want? That is often where things break. And your empty list returned off the xpath search indicates "not found."
Assuming that's okay, your parsing is close. I just tried the following, which is successful:
from lxml import html
tree = html.fromstring('<body><div class="number">76</div></body>')
number = tree.xpath('//div[#class="number"]/text()')[0]
number now equals '76'. Note the [0] indexing, because xpath always returns a list of what's found. You have to dereference to find the content.
A common gotcha here is that the XPath text() function isn't as inclusive or straightforward as it might seem. If there are any sub-elements to the div--e.g. if the text is really <div class="number"><strong>76</strong></div> then text() will return an empty list, because the text belongs to the strong not the div. In real-world HTML--especially HTML that's ever been cut-and-pasted from a word processor, or otherwise edited by humans--such extra elements are entirely common.
While it won't solve all known text management issues, one handy workaround is to use the // multi-level indirection instead of the / single-level indirection to text:
number = ''.join(tree.xpath('//div[#class="number"]//text()'))
Now, regardless of whether there are sub-elements or not, the total text will be concatenated and returned.
Update Ok, if your problem is logging in, you probably want to try a requests.post (rather than .get) at minimum. In simpler cases, just that change might work. In others, the login needs to be done to a separate page than the page you want to retrieve/scape. In that case, you probably want to use a session object:
with requests.Session() as session:
# First POST to the login page
landing_page = session.post(login_url, data=values)
# Now make authenticated request within the session
page = session.get(url)
# ...use page as above...
This is a bit more complex, but shows the logic for a separate login page. Many sites (e.g. WordPress sites) require this. Post-authentication, they often take you to pages (like the site home page) that isn't interesting content (though it can be scraped to identify whether the login was successful). This altered login workflow doesn't change any of the parsing techniques, which work as above.

Beautiful Soup(http://www.pythonforbeginners.com/beautifulsoup/web-scraping-with-beautifulsoup) will help u out.
another way
http://docs.python-guide.org/en/latest/scenarios/scrape/

I'd use plain regex over xml tools in this case. It's easier to handle.
import re
import requests
url = 'http://sleepiq.sleepnumber.com/#/user/-9223372029758346943##2'
values = {'email-email': 'my#gmail.com', 'password-clear': 'Combination',
'password-password': 'mypassword'}
page = requests.get(url, data=values, timeout=5)
m = re.search(r'(\w*)(<div class="number">)(.*)(<\/div>)', page.content)
# m = re.search(r'(\w*)(<title>)(.*)(<\/title>)', page.content)
if m:
print(m.group(3))
else:
print('Not found')

Related

Extract Tag from XML

I'm very new to Python and am attempting my first web scraping project. I'm attempting to extract the data following a tag within a XML data source. I've attached an image of the data I'm working with. My issue is that, it seems like no matter what tag I try to extract I constantly return no results. I am able to return the entire data source so I know the connection is not the issue.
My ultimate goal is to loop through all of the data and return the data following a particular tag. I think if I can understand why I'm unable to print a singular particular tag I should be able to figure out how to loop through all of the data. I've looked through similar posts but I think the tree in my set of data is particularly troublesome (that and my inexperience).
My Code:
from bs4 import BeautifulSoup
import requests
#Assign URL to scrape
URL = "http://api.powertochoose.org/api/PowerToChoose/plans?zip_code=78364"
#Fetch the raw HTML Data
Data = requests.get(URL)
Soup = BeautifulSoup(Data.text, "html.parser")
tags = Soup.find_all('fact_sheet')
print (tags)
Try to check the response of your example first, it is JSON not XML so no BeautifulSoup needed here, simply iterate the data list to pick your fact_sheets:
for plan in Data.json()['data']:
print(plan['fact_sheet'])
Out:
https://rates.cleanskyenergy.com:8443/rates/DownloadDoc?path=a70e9298-5537-481a-985c-c7a005b2e4f3.html&id_plan=223344
https://texpo-prod-api.eroms.works/api/v1/document/ViewProductDocument?type=efl&rateCode=SRCPLF24PTC&lang=en
https://www.txu.com/Handlers/PDFGenerator.ashx?comProdId=TCXSIMVL1212AR&lang=en&formType=EnergyFactsLabel&custClass=3&tdsp=AEPTCC
https://signup.myvaluepower.com/Home/EFL?productId=32653&Promo=16410
https://docs.cloud.flagshippower.com/EFL?term=36&duns=007924772&product=galleon&lang=en&code=FPSPTC2
...
As you've already realized by now, you're getting the data as json, so doing something like:
fact_sheet_links = [d['fact_sheet'] for d in Data.json()['data']]
would get you the data you want.
But also, if you'd prefer to work with the xml, you can add headers to the request:
Data = requests.get(URL, headers={ 'Accept': 'application/xml' })
and get an xml response. When I did this, Soup.find_all('fact_sheet') still did not work (although I've seen this method used in some tutorials, so it might be a version problem - and it might still work for you), but it did work when I used find_all with lambda:
tags = Soup.find_all(lambda t: 'fact_sheet' in t.name)
and the results after altering your code looked like this. That just gives you the tags though, so if you want a list of the contents instead, one way would be to use list comprehension:
fact_sheet_links = [t.text for t in tags]
so that you get them like this.

xpath how to format path

I would like to get #src value '/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg' from webpage
from lxml import html
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
session = requests.session()
page = session.get(URL)
HTMLn = html.fromstring(page.content)
print HTMLn.xpath('//html/body/div[1]/div/div/div[3]/div[19]/div/a[2]/div/div/img/#src')[0]
but I can't. No matter how I format xpath, i tdooesnt work.
In the spirit of #pmuntima's answer, if you already know it's the 14th sourced image, but want to stay with lxml, then you can:
print HTMLn.xpath('//img/#data-src')[14]
To get that particular image. It similarly reports:
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg
If you want to do your indexing in XPath (possibly more efficient in very large result sets), then:
print HTMLn.xpath('(//img/#data-src)[14]')[0]
It's a little bit uglier, given the need to parenthesize in the XPath, and then to index out the first element of the list that .xpath always returns.
Still, as discussed in the comments above, strictly numerical indexing is generally a fragile scraping pattern.
Update: So why is the XPath given by browser inspect tools not leading to the right element? Because the content seen by a browser, after a dynamic JavaScript-based update process, is different from the content seen by your request. Your request is not running JS, and is doing no such updates. Different content, different address needed--if the address is static and fragile, at any rate.
Part of the updates here seem to be taking src URIs, which initially point to an "I'm loading!" gif, and replacing them with the "real" src values, which are found in the data-src attribute to begin.
So you need two changes:
a stronger way to address the content you want (a way that doesn't break when you move from browser inspect to program fetch) and
to fetch the URIs you want from data-src not src, because in your program fetch, the JS has not done its load-and-switch trick the way it did in the browser.
If you know text associated with the target image, that can be the trick. E.g.:
search_phrase = 'DECK SANTA CRUZ STAR WARS EMPIRE STRIKES BACK POSTER'
path = '//img[contains(#alt, "{}")]/#data-src'.format(search_phrase)
print HTMLn.xpath(path)[0]
This works because the alt attribute contains the target text. You look for images that have the search phrase contained in their alt attributes, then fetch the corresponding data-src values.
I used a combination of requests and beautiful soup libraries. They both are wonderful and I would recommend them for scraping and parsing/extracting HTML. If you have a complex scraping job, scrapy is really good.
So for your specific example, I can do
from bs4 import BeautifulSoup
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
r = requests.get(URL)
soup = BeautifulSoup(r.text, "html.parser")
specific_element = soup.find_all('a', class_="product-icon")[14]
res = specific_element.find('img')["data-src"]
print(res)
It will print out
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg

How can I use urllib2, regex, and sys to read a specific URL and then return my search and the results?

I am in need of some help for an assignment. I need to build a "simple" (according to my teacher) web-scraper that takes a URL as an argument, searches the source code of that URL, and then returns the links within that source code back (anything after href). The example URL my teacher has been having us use is http://citstudent.lanecc.net/tools.shtml. When the program is executed, there should be ten links returned as well as the URL of the website.
Since I am still trying to wrap my head around these concepts, I wasn't sure where to start so I turned to stack overflow and I found a script that -kind of- works. It does what I want it to do, but does not fulfill every requirement:
import urllib2
url = "http://citstudent.lanecc.net/tools.shtml"
page = urllib2.urlopen(url)
data = page.read().split("</a>")
tag = "<a href=\""
endtag = "\">"
for item in data:
if "<a href" in item:
try:
ind = item.index(tag)
item = item[ind+len(tag):]
end = item.index(endtag)
except: pass
else:
print item[:end]
This works because I hard-coded the URL into my code, and because it prints after some href tags. Normally I would say to just guide me through this and to not just give me the code, but I'm having such a shit day and any explanation or example of this has to be better than what we went over in class. Thank you.

How to replace local path with global path into href attributes, scraping in python

I trying to scrape some html code from this site, now when i print all the content, some link (i want only "Table of Contents" and "Printer-friendly Version") have inside the href this string: "../etc".
When i'm going to print the scraped code i need to replace the local path of the href with the global one, in that way i'll be able to reach the right webpage clicking on my scraped link. In case the requested operation will be not useful, there's a way for write the right path inside the href i need to handle?
#!C:/Python27/python
from lxml import etree
import requests
q = "http://www.dlib.org/dlib/november14/giannakopoulos/11giannakopoulos.html"
page = requests.get(q)
tree = etree.HTML(page.text)
element = tree.xpath('./body/form/table[3]/tr/td/table[5]')
content = etree.tostring(element[0])
print "Content-type: text\n\n"
print content.strip()
In the resulting HTML, just add an xml:base to the root of the document that's the same as the URL of the original page, or any other base URI where you have the same resources.
XPath cannot change the document for you, you will need to that on the document itself. Etree has a property to set the base URI, but I don't know if it will output that when you print the results. The most obvious way to replace this is to use XSLT, which is also supported by lxml.
Setting the base URI will have effect on elements that respect the base URI. Older browsers may not properly respect it, in which case you can use <base>.

scraping a constantly changing integer from a website

I am trying to extract numeric data from a website. I tried using a simple web scraper to retrieve the data:
from mechanize import Browser
from bs4 import BeautifulSoup
mech = Browser()
url = "http://www.oanda.com/currency/live-exchange-rates/"
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
data1 = soup.find(id='EUR_USD-b-int')
print data1
This kind of approach normally would give the line of data from the website including the contents of the element I am trying to extract. However it gives everything but the contents which is the part I need. I have tried .contents and it returns []. I've also tried .child and it returns 'none'. Does anyone know another method that could work. I have looked through the beautiful soup documentation but I can't seem to find a solution?
The value on this page is updated using Javascript by making a request to
GET http://www.oanda.com/lfr/rates_lrrr?tstamp=1392757175089&lrrr_inverts=1
Referer: http://www.oanda.com/currency/live-exchange-rates/
(Be aware that I was blocked 4 times just looking at this, they are extremely block-happy. This is because they sell this data commercially as a subscription service.)
The request is made and the response parsed in http://www.oanda.com/jslib/wl/lrrr/liverates.js. The response is "encrypted" with RC4 (http://en.wikipedia.org/wiki/RC4)
The RC4 decrypt method is coming from http://www.oanda.com/wandacache/rc4-ea63ca8c97e3cbcd75f72603d4e99df48eb46f66.js. It looks like this file is refreshed often so you'll need to grab the latest link from the homepage and extract the var key=<value> to fully decrypt the value.

Categories

Resources