Scraping attributes values in Python LXML is giving empty results

Scraping attributes values in Python LXML is giving empty results - python

I am trying to scrape a site that you will find its link below in the code
The goal is to get the data from within the attributes since there is no text while inspecting the code
Here is the full XPath of an element:
/html/body/div[2]/div[3]/div/div[3]/section[1]/div/div[2]/div[1]
and the code:
import requests
from lxml import html
page = requests.get('https://www.meilleursagents.com/annonces/achat/nice-06000/appartement/')
tree = html.fromstring(page.content)
trying to scrape the attribute 'data-wa-data' value with:
tree.xpath('/html/body/div[2]/div[3]/div/div[3]/section[1]/div/div[2]/div[1]/#data-wa-data')
is yielding empty values
and the same issue is for another element that has a text:
tree.xpath('/html/body/div[2]/div[3]/div/div[3]/section[1]/div/div[2]/div[1]/div/a/div[1]/text()')

The problem is that this website requires the User-Agent to download the complete HTML which is absent in your case. So, to get the complete page pass user-agent as a header.
Note: This website is more aggressive when it comes to blocking. I mean, you cannot even make two consecutive requests with the same user-agent. Thus, my advice would be to rotate the proxies and user-agent. Moreover, also add download delay between each requests to avoid hitting server rapidly.
Code
import requests
from lxml import html
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0'
}
page = requests.get('https://www.meilleursagents.com/annonces/achat/nice-06000/appartement/', headers=headers)
tree = html.fromstring(page.content)
print(tree.xpath('//div[#class="listing-item search-listing-result__item"]/#data-wa-data'))
output
['listing_id=1971029217|realtor_id=21407|source=listings_results', 'listing_id=1971046117|realtor_id=74051|source=listings_results', 'listing_id=1971051280|realtor_id=71648|source=listings_results', 'listing_id=1971053639|realtor_id=21407|source=listings_results', 'listing_id=1971053645|realtor_id=38087|source=listings_results', 'listing_id=1971053650|realtor_id=29634|source=listings_results', 'listing_id=1971053651|realtor_id=29634|source=listings_results', 'listing_id=1971053652|realtor_id=29634|source=listings_results', 'listing_id=1971053656|realtor_id=39588|source=listings_results', 'listing_id=1971053658|realtor_id=39588|source=listings_results', 'listing_id=1971053661|realtor_id=39588|source=listings_results', 'listing_id=1971053662|realtor_id=39588|source=listings_results']

Related

Is there a way to make the html elements of a website more visible?

While scraping the following website (https://www.middletownk12.org/Page/4113), this code could not locate the table rows (To get the staff name, email & department) even though they are visible when I use the Chrome developer tools. The soup object is not readbale enough to locate the tr tags that have the info needed.
import requests
from bs4 import BeautifulSoup
url = "https://www.middletownk12.org/Page/4113"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
print(response.text)
I used different libraries such as bs4, request & selenium with no chance. I also tried Css selectors & XPATH with selenium with no chance. The Tr elements could not be located.

That table of contact information is filled in by Javascript after the page has loaded. The content doesn't exist in the page's HTML and you won't see it using requests.
By using the developer tools available in the browser, we can examine the requests made after the page has loaded. There are a lot of them, but at least in my browser it's obvious the contact information is loaded near the end.
Looking at the request log, I see a request for a spreadsheet from docs.google.com:
If we examine that entry, we find that it's a request for:
https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0
And if we fetch the above link, we get a spreadsheet with the source data for that table.
Actually I used Selenium & then bs4 without any results. The code does not find the 'tr' elements...
Why are you using Selenium? The whole point to this answer is that you don't need to use Selenium if you can figure out the link to retrieve the data -- which we have.
All we need is requests to fetch the data and BeautifulSoup to parse it:
import requests
import bs4
url = 'https://docs.google.com/spreadsheets/d/e/2PACX-1vSPXpr9MjxZXaYteex9ZMydfXx81YWqf5Coh9TfcB0q8YNRWrYTAtypX3IPlW44ZzXmhaSiQGNY-yle/pubhtml/sheet?headers=false&gid=0'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
for link in soup.findAll('a'):
print(f"{link.text}: {link.get('href')}")

Scraping Website does not return correct source code

im trying to scrape a quizlet match set with Python. I want to scrape all the <span> tags with class: TermText
Here's the URL: 'https://quizlet.com/291523268'
import requests
raw = requests.get(URL).text
raw ends up returning things that do not contain any tags or cards at all. When I check the source of the website it shows all the TermText spans that I need meaning it's not JS loaded. Thus, I don't understand why my HTML is coming out wrong since it doesn't contain any of the html I need.

To get correct response from server, set correct User-Agent HTTP header:
import requests
from bs4 import BeautifulSoup
url = 'https://quizlet.com/291523268/python-flash-cards/'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:79.0) Gecko/20100101 Firefox/79.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for span in soup.select('span.TermText'):
print(span.get_text(strip=True))
Prints:
algorithm
A set of specific steps for solving a category of problems
token
basic elements of a language(letters, numbers, symbols)
high-level language
A programming language like Python that is designed to be easy for humans to read and write.
low-level langauge
...and so on.

Python : find_all() return an empty list

I'm trying to make a bot that send me an email once a new product is online on a website.
I tried to do that with requests and beautifulSoup.
This is my code :
import requests
from bs4 import BeautifulSoup
URL = 'https://www.vinted.fr/vetements?search_text=football&size_id[]=207&price_from=0&price_to=15&order=newest_first'
headers = {'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", class_="c-box")
print(len(products))
Next, I'll want to compare the number of products before and after my new request in a loop.
But when I try to see the number of products that I found, I get an empty list : []
I don't know how to fix that ...
The div that I use is in others div, I don't know if it has a relation
Thanks by advance

You have problem with the website that you are trying to parse.
The website in your code generates elements you are looking for(div.c-box) after the website is fully loaded, using javascript, at the client-side. So it's like:
Browser gets HTML source from server --(1)--> JS files loaded as browser loads html source --> JS files add elements to the HTML source --(2)--> Those elements are loaded to the browser
You cannot fetch the data you want by requests.get because requests.get method can only get HTML source at point (1), but the website loads the data at (2) point. To fetch such data, you should use automated browser modules such as selenium.

You should always check the data.
Convert your BeautifulSoup object to string with soup.decode('utf-8') and write it on a file. Then check what you get from the website. In this case, there is no element with c-box class.
You should use selenium instead of requests.

requests.get() returns different HTML than the one on my browser

Was trying to get links from this website. But noticed the links that I get from parsing are different from the ones that are showing on my browser. There aren't any missing links because both the browser and results from parsing show 14 hyperlinks(for series).
But my browser shows some link which my "result" don't have and my "result" shows some link which my browser doesn't have.
For example my results show a link like
"https://4anime.to/anime/one-piece-nenmatsu-tokubetsu-kikaku-mugiwara-no-luffy-oyabun-torimonochou"
but when i searched for the word "torimonochou" in the browser I could not find any match.
Searched for the link in page source(right clicked the page and selected view page source) so i should not be missing anything. Also passed my browser's header in requests.get() so I should be getting the same HTML code.
The code :
head = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0'}
searchResObj = requests.get("https://4anime.to/?s=one+piece", headers = head)
soupObj = bs4.BeautifulSoup(searchResObj.text, features="html.parser")
Tried all kinds of different approach to parse links. This is just a simplified version which fetches all the links in the page so I am not missing any.
all_a = soupObj.select("a")
for links in all_a:
print(links.get("href"))
Also viewed the HTML code from my compiler. The hyperlinks are indeed different than the ones showing in my browser
print(searchResObj.text)
So what might be causing this?

Running this script will print 14 links which show in browser too (maybe you've got Captcha page?):
import requests
from bs4 import BeautifulSoup
searchResObj = requests.get("https://4anime.to/?s=one+piece")
soupObj = BeautifulSoup(searchResObj.text, features="html.parser")
for a in soupObj.select('#headerDIV_95 > a'):
print(a['href'])
Prints:
https://4anime.to/anime/one-piece-nenmatsu-tokubetsu-kikaku-mugiwara-no-luffy-oyabun-torimonochou
https://4anime.to/anime/one-piece-straw-hat-theater
https://4anime.to/anime/one-piece-movie-14-stampede
https://4anime.to/anime/one-piece-yume-no-soccer-ou
https://4anime.to/anime/one-piece-mezase-kaizoku-yakyuu-ou
https://4anime.to/anime/one-piece-umi-no-heso-no-daibouken-hen
https://4anime.to/anime/one-piece-film-gold
https://4anime.to/anime/one-piece-heart-of-gold
https://4anime.to/anime/one-piece-episode-of-sorajima
https://4anime.to/anime/one-piece-episode-of-sabo
https://4anime.to/anime/one-piece-episode-of-nami
https://4anime.to/anime/one-piece-episode-of-merry
https://4anime.to/anime/one-piece-episode-of-luffy
https://4anime.to/anime/one-piece-episode-of-east-blue
EDIT: Screenshot from "View Source Code":

python web scraping Weatherforecast

I'm new to Python(actually second time I try to learn the language so i know a little something) and I'm trying to build a script that scrapes the weather forecast.
Now i have a little problem with finding the right html classes to import into python. I have this code now:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(city_name)
Problem is that this just returns 'None'
I found the class that the code searches for via chrome and inspect page. If i export the html page through python with the following code:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(page.content, 'html.parser')
city_name = soup.find(class_="weather-widget__city-name")
print(soup.prettify())
Then I see the html page in cmd(as expected) but I'm also unable to find 'class_="weather-widget__city-name"' so I'm not amazed that python is also unable to. My question is, why is the html code that python gives me different than the html code Chrome shows on the site? And am I doing something wrong with trying to find the weather widget through BeautifulSoup this way?
Here is a picture from the page, the part that I'm trying to scrape is circled in red.
Screenshot from website
Thanks in advance!

That site is loaded with JS.
Python requests doesn't activate those scripts. One of those scripts is responsible for loading the data you are looking for (you can see it's JS, maybe with a bit jQuery I didn't actually check, by the spinning circle while it's loading).
My suggestion here is use the sites API.
I am not subscribed to the site so I can't show here an example but the trick is simple. You subscribe to the sites API with the basic (and free) plan, get an API key and start sending get requests to the API URLs.
This will also simplify things for you further since you wouldn't need BeautifulSoup for the parsing. The responses are all in JSON.
There is another nastier way around it and that is using selenium. This module will simulate the web browser and all of it's JS activating, HTML rendering, CSS loading mechanisms.
I have experience with both and I strongly recommend sticking to the API (if that option exists).

For sites that use JS to send further requests, after we request the initial URL, one method that can work is to study the network tab of Chrome's developer tools (or the equivalent in any other browser).
You'll usually find a big list of URLs being requested by the browser. Most of them are unnecessary for our purposes. And few of them relate to other sites such as Google, Facebook.
In this particular case, after the initial URL is requested, you'll find a few '.js' files being retrieved and after that, three scripts (forecast, weather, daily) that correspond to the data which finally gets presented by the browser.
From those three, the data you ask for comes from the 'weather' script. If you click on it in the network tab, another side pane will open which will contain header information, preview, etc.
In the Headers tab, you'll find the URL that you need to use, which is:
https://openweathermap.org/data/2.5/weather?id=2743477&units=metric&appid=b1b15e88fa797225412429c1c50c122a1
The b1b15e88fa797225412429c1c50c122a1 might be a general API key that is assigned to a browser request. I don't know for sure. But all we need to know is that it doesn't change. I've tried on two different systems and this value doesn't change.
The 2743477 is, of course, the City ID. You can download a reference of various cities and their IDs in their site itself:
http://bulk.openweathermap.org/sample/
As nutmeg64 said, the site actually responds with a JSON file. That's the case with both the API and the request of this URL found in the network tab of a browser.
As for the codes appearing in the JSON, the site gives you a reference to the codes and their meanings:
https://openweathermap.org/weather-conditions
With this information, you can use requests and json to retrieve and manipulate the data. Here's a sample script:
from pprint import pprint
import json
import requests
city_id = 2743477
url = 'https://openweathermap.org/data/2.5/weather?id={}&units=metric&appid=b1b15e88fa797225412429c1c50c122a1'.format(city_id)
req_headers = {
'Accept': '*/*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive',
'Host': 'openweathermap.org',
'Referer': 'https://openweathermap.org/city/2743477',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
}
s = requests.Session()
r = s.get(url, headers=req_headers)
d = json.loads(r.text)
pprint(d)
However, as nutmeg64 said, it's better to use the API, and resist the temptation to bombard the site with more requests than you truly need.
You can find all about their API here:
https://openweathermap.org/current

Use selenium in combination with BeautifulSoup to get any of the table from that page with no hardship. Here is how you can do:
from selenium import webdriver
from bs4 import BeautifulSoup
driver=webdriver.Chrome()
driver.get("https://openweathermap.org/city/2743477")
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.quit()
table_tag = soup.select(".weather-widget__items")[0]
tab_data = [[item.text.strip() for item in row_data.select("td")]
for row_data in table_tag.select("tr")]
for data in tab_data:
print(data)
Partial result:
['Wind', 'Gentle Breeze,\n 3.6 m/s, Southwest ( 220 )']
['Cloudiness', 'Broken clouds']
['Pressure', '1014 hpa']
['Humidity', '100 %']
['Sunrise', '11:53']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping attributes values in Python LXML is giving empty results - python

Related

Is there a way to make the html elements of a website more visible?

Scraping Website does not return correct source code

Python : find_all() return an empty list

requests.get() returns different HTML than the one on my browser

python web scraping Weatherforecast

Categories

Resources