Beautiful Soup Classic Confusion - python

Working with Python and Beautifulsoup. A bit new to CSS markup, so I know I'm making some beginner mistakes, a specific example would go a long way in helping me understand.
I'm trying to scrape a page for links, but only certain links.
CSS
links = soup.find_all("a", class_="details-title")

The code you have will search for links with the details-title class, which don't exist in the sample you provided. It seems like you are trying to find links located inside divs with the details-title class. I believe the easiest way to do this is to search using CSS selectors, which you can do with Beautiful Soup's .select method.
Example: links = soup.select("div.details-title a")
The <tag>.<class> syntax searches for all tags with that class, and elements separated by a space will search for sub-elements of the results before it. See here for more information.

Related

How to extract a text from ng-herf with scrapy

There is a real state website with an infinite scroll down and I have tried to extract the companies' names and other details but I have a problem with writing the selectors need some insights for a new learner in scrapy.
HTML Snippet:
After handling if "more" button is available in website.
So, the selector appears in most browsers you can copy selectors like this
based on the function you are using you copy "xpath" or something else for scrapping process,
If that's does not help please give the link to webpage and select what values you want to scrap.
As I understand, you want to get the href from the tag and you don't know how to do it in scrapy.
you just need to add ::attr(ng-href) this to the last of your CSS selectors.
link = response.css('your_selector::attr(ng-href)').get()
to make it easier for you your CSS selector should be
link = response.css('.companyNameSpecs a::attr(ng-href)').get()
but it looks like the href and ng-href is the same you can also do the same with it
link = response.css('your_selector::attr(href)').get()

HTML Selector using python’s bs4

I’m fairly new at this, and I’m trying to work through Automate the Boring stuff and make some of my own programs along the way. I’m trying to use beautiful soup’s ‘select’ method to pull the value ‘33’ out of this code
<span class="wu-value wu-value-to" _ngcontent-c19="">33</span>
I know that the span element is inside a div and i’ve tried a few selectors including:
high_temp = w_u_soup.select('div > span .wu-value wu-value-to')
But I haven’t been able to get 33 out. Any help would be appreciated. I’ve tried to look up what _ngcontent-c19 is, but I’m having trouble understanding what i’ve found thus far (I’m trying to learn python and it seems I’ll be learning a bit of HTML as a consequence)
I think you have a couple of different issues here.
First, your selector is wrong -- the selector you have is trying to select an element called wu-value-to (which isn't a valid HTML element) inside something with class wu-value inside a span which is a direct descendent of a div. To select an element with particular classes you need no space between the element name and the class descriptors.
So your selector should probably be div > span.wu-value.wu-value-to. If your entire HTML is the part you showed, just 'span' would be enough, but I'm guessing you are being specific by specifying the parent and those classes for a reason.
Second, you are selecting the element, not its text content. You need your_node.text to get the text content.
Putting it together, you should be able to get what you want with this:
w_u_soup.select('div > span.wu-value.wu-value-to').text

Crawling text of a specific heading for any web page URL document in python

I have searched and get a little bit introduced to some of the web crawling libraries in python like scrapy, beautifulsoup etc. Using these libraries I want to crawl all of the text under a specific heading in a document. If any of you can help me his/her help would be highly appreciated. I have seen some tutorial that how one can get links under a specific class name (by view source page option) using beautiful soap but how can I get a simple text not links under the specific class of heading. Sorry for my bad English
import requests
from bs4 import BeautifulSoup
r=requests.get('https://patents.google.com/patent/US6886010B2/en')
print(r.content)
soup=BeautifulSoup(r.content)
for link in soup.find_all("div", class_="claims"):
print(link)
Here i have extracted claims text but it also shows other div written in these claims that is div in div i just want to extract the text of the claims only.
By links, I assume you mean the entire contents of the div elements. If you'd like to just print the text contained within them, use the .text attribute or .get_text() method. The entire text of the claims is wrapped inside a unique section element. So you might want to try this:
print(soup.find('section', attrs={'id': 'claims'}).text)
The get_text method gives you a bit more flexibility such as joining bits of text together with a separator and stripping the text of extra newlines.
Also, take a look at the BeautifulSoup Documentation and spend some time reading it.

scraping css values using scrapy framework

Is there a way to scrap css values while scraping using python scrapy framework or by using php scraping.
any help will be appreaciated
scrapy.Selector allows you to use xpath to extract properties of HTML elements including CSS.
e.g. https://github.com/okfde/odm-datenerfassung/blob/master/crawl/dirbot/spiders/data.py#L83
(look around that code for how it fits into an entire scrapy spider)
If you don't need web crawling and just html parsing, you can use xpath directly from lxml in python. Another example:
https://github.com/codeformunich/feinstaubbot/blob/master/feinstaubbot.py
Finally, to get at the css from xpath I only know how to do it via css=element.attrib['style'] - this gives you everything inside of the style attribute which you further split by e.g. css.split(';') and then each of those by ':'.
It wouldn't surprise me if someone has a better suggestion. A little knowledge is enough to do a lot of scraping and that's how I would approach it based on previous projects.
Yes, please check the documentation for selectors basically you've two methods response.xpath() for xpath and response.css() for css selectors. For example, to get a title's text you could do any of the following:
response.xpath('//title/text()').extract_first()
response.css('title::text').extract_first()

Parsing multiple News articles

I have built a program for summarization that utilizes a parser to parse from multiple websites at a time. I extract only <p> in each article.
This throws out a lot of random content that is unrelated to the article. I've seen several people who can parse any article perfectly. How can i do it? I am using Beautiful Soup
Might be worth you trying an existing package like python-goose which does what it sounds like you're asking for, extracting article content from web pages.
Your solution is really going to be specific to each website page you want to scrape, so, without knowing the websites of interest, the only thing I could really suggest would be to inspect the page source of each page you want to scrape and look if the article is contained in some html element with a specific attribute (either a unique class, id, or even summary attribute) and then use beautiful soup to get the inner html text from that element

Categories

Resources