Scraping a webpage with Python - python

I'm trying to learn to scrape webpage (http://www.expressobeans.com/public/detail.php/185246), however I don't know what I'm doing wrong. I think it's to do with identifing the xpath but how do I get the correct path (if that is the issue)? I've tried Firebug in Firefox as well as the Developer Tools in Chrome.
I want to be able to scrape the Manufacturer value (D&L Screenprinting) as well as all the Edition Details.
python script:
from lxml import html
import requests
page = requests.get('http://www.expressobeans.com/public/detail.php/185246')
tree = html.fromstring(page.text)
buyers = tree.xpath('//*[#id="content"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/dl/dd[3]')
print buyers
returns:
[]

remove tbody from the xpath
buyers = tree.xpath('//*[#id="content"]/table/tr[2]/td/table/tr/td[1]/dl/dd[3]')

I'd start by suggesting you look at the page HTML and try to find a node closer to the value you are looking for, and build your path from there to make it shorter and easier to follow.
In that page I can see that there is a "dl" with class "itemListingInfo" and under that one all the information you are looking for.
Also, if you want the "D&L Screenprinting" text, you need to extract the text from the link.
Try with this modified version, it should be straightforward to add the other xpath expressions and get the other fields as well.
from lxml import html
import requests
page = requests.get('http://www.expressobeans.com/public/detail.php/185246')
tree = html.fromstring(page.text)
buyers = tree.xpath('//dl[#class="itemListingInfo"]/dd[2]/a/text()')
print buyers

Related

Select css tags with randomized letters at the end

I am currently learning web scraping with python. I'm reading Web scraping with Python by Ryan Mitchell.
I am stuck at Crawling Sites Through Search. For example, reuters search given in the book works perfectly but when I try to find it by myself, as I will do in the future, I get this link.
Whilst in the second link it is working for a human, I cannot figure out how to scrape it due to weird class names like this class="media-story-card__body__3tRWy"
The first link gives me simple names, like this class="search-result-content" that I can scrape.
I've encountered the same problem on other sites too. How would I go about scraping it or finding a link with normal names in the future?
Here's my code example:
from bs4 import BeautifulSoup
import requests
from rich.pretty import pprint
text = "hello"
url = f"https://www.reuters.com/site-search/?query={text}"
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
results = soup.select("div.media-story-card__body__3tRWy")
for result in results:
pprint(result)
pprint("###############")
You might resort to a prefix attribute value selector, like
div[class^="media-story-card__body__"]
This assumes that the class is the only one ( or at least notationally the first ). However, the idea can be extended to checking for a substring.

Unable to find tag when data scraping

I am new to Python and I've been working on a program that alerts you when a new item is uploaded to jp.mercari.com (a shopping site). I have the alert part of the program working, but it operates based on the number of items that come up on the search results. When I scrape the website I am unable to find what I am looking for despite being able to locate it when I inspect element on the page. The scraping program looks like this:
from bs4 import BeautifulSoup
import requests
url = "https://jp.mercari.com/search?keyword=pachinko"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
tag = doc.find_all("mer-text")
print(tag)
For more context, this is the website and some of the HTML. I've circled the parts I am trying to find in red:
Does anyone know why I am unable to find what I'm looking for?
Here is another example of the same problem but from a website that is in English:
import requests
url = "https://www.vinted.co.uk/vetements?search_text=pachinko"
result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")
tag = doc.find_all("span")
print(tag)
Again, I can see the part of HTML I want to find when I inspect element but I can't find it when I scrape the website:
Here's what's happening with me: the element you seek (<mer-text>) is being found. However, the output is in Japanese, and Python doesn't know what to do with that. In my browser, it's being translated to English automatically by Google, so that's easier to deal with.

Extract tables in webpages from Python/R or other software

I would like to extract Name, Address of School, Tel, Fax ,Head of School from the website:
https://www.edb.gov.hk/en/student-parents/sch-info/sch-search/schlist-by-district/school-list-cw.html
Is it possible to do so?
yes it is possible and there are many tool that help you do that. If you do not want to use a programming language, you can use plenty of tools (but probably have to pay for them, here is an article that might be useful: https://popupsmart.com/blog/web-scraping-tools).
However, If you want to use python, what you should do is to load the page and then parse HTML. Then you should look you desirable element and fetch its data. This article explains the whole process with code: https://www.freecodecamp.org/news/web-scraping-python-tutorial-how-to-scrape-data-from-a-website/
Here is a simple code that shows the tables in the page that you posted, based on the code from above paper:
import requests
from bs4 import BeautifulSoup
# Make a request
page = requests.get(
"https://www.edb.gov.hk/en/student-parents/sch-info/sch-search/schlist-by-district/school-list-cw.html")
soup = BeautifulSoup(page.content, 'html.parser')
# Create top_items as empty list
top_items = []
# Extract and store in top_items according to instructions on the left
products = soup.select('table')
for elem in products:
print(elem)
You can try it out here:
https://colab.research.google.com/drive/13EzFWNBqpkGf4CZvGt5pYySCuW7Ij6I4?usp=sharing

Scaping all the names from a website

I am currently trying to scape all of the names from a specific website. I was making some progress by following a guide on python-guide.org. I was able to scrape a lot of the information off of a certain site, but not the information I was after. Here is my code so far:
from lxml import html
import requests
page = requests.get('http://www.behindthename.com/names/gender/feminine/usage/african')
tree = html.fromstring(page.content)
#This will create a list of buyers:
Names = tree.xpath('//div[#class="browsename"]/text()')
print 'Names: ', Names
Unfortunately, that returns a lot of information, but not the list of names. I'm not sure what I'm doing wrong but I am certain it has to do with the #class="bowsername". I'm not very familiar with HTML.
maybe, you should use:
//div[#class="browsename"]/b/a/text()
In chrome, you can use F12 to inspect elements, then use CTRL + F, and input your xpath. Chrome will show you what elements you choose.

How do I make the code return the text using xpath?

from lxml import html
import requests
import time
#Gets prices
page = requests.get('https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=hi')
tree = html.fromstring(page.content)
price = tree.xpath('//h2[#data-attribute="Hi Guess the Food - What’s the Food Brand in the Picture"]/text()')
print(price)
This only returns []
When looking into page.content, it shows the amazon anti bot stuff. How can I bypass this?
One general advice when you're trying to scrap something from some website. Take a look first to the returned content, in this case page.content before trying anything. You're assuming wrongly amazon is allowing you nicely to fetch their data, when they don't.
I think urllib2 is better, and xpath could be:
price = c.xpath('//div[#class="s-item-container"]//h2')[0]
print price.text
After all, long string may contains strange characters.

Categories

Resources