xpath how to format path - python

I would like to get #src value '/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg' from webpage
from lxml import html
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
session = requests.session()
page = session.get(URL)
HTMLn = html.fromstring(page.content)
print HTMLn.xpath('//html/body/div[1]/div/div/div[3]/div[19]/div/a[2]/div/div/img/#src')[0]
but I can't. No matter how I format xpath, i tdooesnt work.

In the spirit of #pmuntima's answer, if you already know it's the 14th sourced image, but want to stay with lxml, then you can:
print HTMLn.xpath('//img/#data-src')[14]
To get that particular image. It similarly reports:
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg
If you want to do your indexing in XPath (possibly more efficient in very large result sets), then:
print HTMLn.xpath('(//img/#data-src)[14]')[0]
It's a little bit uglier, given the need to parenthesize in the XPath, and then to index out the first element of the list that .xpath always returns.
Still, as discussed in the comments above, strictly numerical indexing is generally a fragile scraping pattern.
Update: So why is the XPath given by browser inspect tools not leading to the right element? Because the content seen by a browser, after a dynamic JavaScript-based update process, is different from the content seen by your request. Your request is not running JS, and is doing no such updates. Different content, different address needed--if the address is static and fragile, at any rate.
Part of the updates here seem to be taking src URIs, which initially point to an "I'm loading!" gif, and replacing them with the "real" src values, which are found in the data-src attribute to begin.
So you need two changes:
a stronger way to address the content you want (a way that doesn't break when you move from browser inspect to program fetch) and
to fetch the URIs you want from data-src not src, because in your program fetch, the JS has not done its load-and-switch trick the way it did in the browser.
If you know text associated with the target image, that can be the trick. E.g.:
search_phrase = 'DECK SANTA CRUZ STAR WARS EMPIRE STRIKES BACK POSTER'
path = '//img[contains(#alt, "{}")]/#data-src'.format(search_phrase)
print HTMLn.xpath(path)[0]
This works because the alt attribute contains the target text. You look for images that have the search phrase contained in their alt attributes, then fetch the corresponding data-src values.

I used a combination of requests and beautiful soup libraries. They both are wonderful and I would recommend them for scraping and parsing/extracting HTML. If you have a complex scraping job, scrapy is really good.
So for your specific example, I can do
from bs4 import BeautifulSoup
import requests
URL = 'http://systemsklep.pl/pol_m_Kategorie_Deskorolka_Deski-281.html'
r = requests.get(URL)
soup = BeautifulSoup(r.text, "html.parser")
specific_element = soup.find_all('a', class_="product-icon")[14]
res = specific_element.find('img')["data-src"]
print(res)
It will print out
/pol_il_DECK-SANTA-CRUZ-STAR-WARS-EMPIRE-STRIKES-BACK-POSTER-8-25-20135.jpg

Related

Select css tags with randomized letters at the end

I am currently learning web scraping with python. I'm reading Web scraping with Python by Ryan Mitchell.
I am stuck at Crawling Sites Through Search. For example, reuters search given in the book works perfectly but when I try to find it by myself, as I will do in the future, I get this link.
Whilst in the second link it is working for a human, I cannot figure out how to scrape it due to weird class names like this class="media-story-card__body__3tRWy"
The first link gives me simple names, like this class="search-result-content" that I can scrape.
I've encountered the same problem on other sites too. How would I go about scraping it or finding a link with normal names in the future?
Here's my code example:
from bs4 import BeautifulSoup
import requests
from rich.pretty import pprint
text = "hello"
url = f"https://www.reuters.com/site-search/?query={text}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
results = soup.select("div.media-story-card__body__3tRWy")
for result in results:
pprint(result)
pprint("###############")
You might resort to a prefix attribute value selector, like
div[class^="media-story-card__body__"]
This assumes that the class is the only one ( or at least notationally the first ). However, the idea can be extended to checking for a substring.

Unable to find tag when data scraping

I am new to Python and I've been working on a program that alerts you when a new item is uploaded to jp.mercari.com (a shopping site). I have the alert part of the program working, but it operates based on the number of items that come up on the search results. When I scrape the website I am unable to find what I am looking for despite being able to locate it when I inspect element on the page. The scraping program looks like this:
from bs4 import BeautifulSoup
import requests
url = "https://jp.mercari.com/search?keyword=pachinko"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
tag = doc.find_all("mer-text")
print(tag)
For more context, this is the website and some of the HTML. I've circled the parts I am trying to find in red:
Does anyone know why I am unable to find what I'm looking for?
Here is another example of the same problem but from a website that is in English:
import requests
url = "https://www.vinted.co.uk/vetements?search_text=pachinko"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
tag = doc.find_all("span")
print(tag)
Again, I can see the part of HTML I want to find when I inspect element but I can't find it when I scrape the website:
Here's what's happening with me: the element you seek (<mer-text>) is being found. However, the output is in Japanese, and Python doesn't know what to do with that. In my browser, it's being translated to English automatically by Google, so that's easier to deal with.

Python (Selenium/BeautifulSoup) Search Result Dynamic URL

Disclaimer: This is my first foray into web scraping
I have a list of URLs corresponding to search results, e.g.,
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
I'm trying to use Selenium to access the HTML of the result as follows:
for url in detail_urls:
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())
However, when I comb through the resulting prettified soup, I notice that the components I need are missing. Upon looking back at the page loading process, I see that the URL redirects a few times as follows:
http://www.vinelink.com/vinelink/servlet/SubjectSearch?siteID=34003&agency=33&offenderID=2662
https://www.vinelink.com/#/searchResults/id/offender/34003/33/2662
https://www.vinelink.com/#/searchResults/1
Does anyone have a tip on how to access the final search results data?
Update: After further exploration this seems like it might have to do with the scripts being executed to retrieve the relevant data for display... there are many search results-related scripts referenced in the page_source; is there a way to determine which is relevant?
I am able to Inspect the information I need per this image:
Once you have your soup variable with the HTML follow the code below..
import json
data = soup.find('search-result')['data']
print(data)
Output:
Now treat each value like a dict.
{"offender_sid":154070373,"siteId":34003,"siteDesc":"NC_STATE","first_name":"WESLEY","last_name":"ADAMS","middle_initial":"CHURCHILL","alias_first_name":null,"alias_last_name":null,"alias_middle_initial":null,"oid":"2662","date_of_birth":"1965-11-21","agencyDesc":"Durham County Detention Center","age":53,"race":2,"raceDesc":"African American","gender":null,"genderDesc":null,"status_detail":"Durham County Detention Center","agency":33,"custody_status_cd":1,"custody_detail_cd":33,"custody_status_description":"In Custody","aliasFlag":false,"registerValid":true,"detailAgLink":false,"linkedCases":false,"registerMessage":"","juvenile_flg":0,"vineLinkInd":1,"vineLinkAgAccessCd":2,"links":[{"rel":"agency","href":"//www.vinelink.com/VineAppWebService/api/site/agency/34003/33"},{"rel":"self","href":"//www.vinelink.com/VineAppWebService/api/offender/?offSid=154070373&lang=en_US"}],"actions":[{"name":"register","template":"//www.vinelink.com/VineAppWebService/api/register/{json data}","method":"POST"}]}
Next:
info = json.loads(data)
print(info['first_name'], info['last_name'])
#This prints the first and last name but you can get others, just get the key like 'date_of_birth' or 'siteId'. You can also assign them to variables.

When use Safariā€˜s Develop->Show page source, what's the difference between Elements and Resources?

When using Safari's develop tool to obtain the webpage's source code, there seems two columns, one is Elements, another is Resources. For some webpages, they are different, while for others, they are the same. And there is only Elements in Chrome.
In Safari's Elements I can easily locate the elements I want, but in Resources I can't.
This surely matters cuz I want to scrape a webpage whose Elements and Resources are different...And when I code like this:
from bs4 import BeautifulSoup
import requests
params = {"TPL_username": "xxxx", "TPL_password": "1xxx"}
tb = requests.post("https://login.xxxx.com/",data=params)
r = requests.get('http://www.xxxx.com',cookies=tb.cookies)
content = r.content
print(content)
soup = BeautifulSoup(content, "html.parser")
timelists = bs.find_all("span", {"class": "bought-wrapper-mod__create-time___yNWVS"})
print(timelists)
the print output(content) is the Resources rather than the Elements which I want. So I can't locate the element with Resources, and it is obvious that the code I see in Resources is not the real source code of the webpage. So why is that? requests.get doesn't return the real source code of the webpage?

Extract Text from HTML div using Python and lxml

I'm trying to get python to extract text from one spot of a website. I've identified the HTML div:
<div class="number">76</div>
which is in:
...div/div[1]/div/div[2]
I'm trying to use lxml to extract the '76' from that, but can't get a return out of it other than:
[]
Here's my code:
from lxml import html
import requests
url = 'https://sleepiq.sleepnumber.com/#/##1'
values = {'username': 'my#gmail.com',
'password': 'mypassword'}
page = requests.get(url, data=values)
tree = html.fromstring(page.content)
hr = tree.xpath('//div[#class="number"]/text()')
print hr
Any suggestions? I feel this should be pretty easy, thanks in advance!
Update: the element I want is not contained in the page.content from requests.get
Updated Update: It looks like this is not logging me in to the page where the content I want is. It is only getting the login screen content.
Have you tried printing your page.content to make sure your requests.get is retrieving the content you want? That is often where things break. And your empty list returned off the xpath search indicates "not found."
Assuming that's okay, your parsing is close. I just tried the following, which is successful:
from lxml import html
tree = html.fromstring('<body><div class="number">76</div></body>')
number = tree.xpath('//div[#class="number"]/text()')[0]
number now equals '76'. Note the [0] indexing, because xpath always returns a list of what's found. You have to dereference to find the content.
A common gotcha here is that the XPath text() function isn't as inclusive or straightforward as it might seem. If there are any sub-elements to the div--e.g. if the text is really <div class="number"><strong>76</strong></div> then text() will return an empty list, because the text belongs to the strong not the div. In real-world HTML--especially HTML that's ever been cut-and-pasted from a word processor, or otherwise edited by humans--such extra elements are entirely common.
While it won't solve all known text management issues, one handy workaround is to use the // multi-level indirection instead of the / single-level indirection to text:
number = ''.join(tree.xpath('//div[#class="number"]//text()'))
Now, regardless of whether there are sub-elements or not, the total text will be concatenated and returned.
Update Ok, if your problem is logging in, you probably want to try a requests.post (rather than .get) at minimum. In simpler cases, just that change might work. In others, the login needs to be done to a separate page than the page you want to retrieve/scape. In that case, you probably want to use a session object:
with requests.Session() as session:
# First POST to the login page
landing_page = session.post(login_url, data=values)
# Now make authenticated request within the session
page = session.get(url)
# ...use page as above...
This is a bit more complex, but shows the logic for a separate login page. Many sites (e.g. WordPress sites) require this. Post-authentication, they often take you to pages (like the site home page) that isn't interesting content (though it can be scraped to identify whether the login was successful). This altered login workflow doesn't change any of the parsing techniques, which work as above.
Beautiful Soup(http://www.pythonforbeginners.com/beautifulsoup/web-scraping-with-beautifulsoup) will help u out.
another way
http://docs.python-guide.org/en/latest/scenarios/scrape/
I'd use plain regex over xml tools in this case. It's easier to handle.
import re
import requests
url = 'http://sleepiq.sleepnumber.com/#/user/-9223372029758346943##2'
values = {'email-email': 'my#gmail.com', 'password-clear': 'Combination',
'password-password': 'mypassword'}
page = requests.get(url, data=values, timeout=5)
m = re.search(r'(\w*)(<div class="number">)(.*)(<\/div>)', page.content)
# m = re.search(r'(\w*)(<title>)(.*)(<\/title>)', page.content)
if m:
print(m.group(3))
else:
print('Not found')

Categories

Resources