How to extract article contents from websites with different layouts - python

I have a list of 1,000 URLs of articles published by different agencies, and of course each has its own HTML layout.
I am writing a python code to extract ONLY the article body from each URL. Can this be done by only looking at the < p>< /p> Paragraph tags?
Will I be missing some content? or including irrelevant content by this approach?
Thanks

For some articles you will be missing content, and for others you will include irrelevant content. There is really no way to grab just the article body from a URL since each site layout will likely vary significantly.
One thing you could try is grabbing text contained in multiple consecutive p tags inside the body tag, but there is still no guarantee you will get just the body of the article.
It would be a lot easier if you broke the list of URLs into a list for each distinct site, that would you could define what the article body is case by case.

To answer your question, it's highly unlikely you can get ONLY article content targeting <p></p> tags. You WILL get a lot of unnecessary content that will take a ton of effort to filter through, guaranteed.
Try to find an RSS feed for these websites. That will make scraping target data much easier than parsing an entire HTML page.

Related

css selector for instagram post alongside comments not working

In my example code below I have navigated to Obama's first Instagram post. I am trying to point to the portion of the page that is his post and the comments beside it.
driver.get("https://www.instagram.com/p/B-Sj7CggmHt/")
element = driver.find_element_by_css_selector("div._97aPb")
I want this to work for the page of any post and of any Instagram user, but it seems that the xpath for the post alongside the comments changes. How can I find the post image + comments combined block regardless of which post it is? Would appreciate any help thank you.
I would also like to be able to individually point to the image and individually point to the comments. I have gone through multiple user profiles and multiple posts but both the xpaths and css selectors seem to change. Would also appreciate guidance on any reading or resources where I can learn how to properly point to different html elements.
You could try selecting based on the top level structure. Looking more closely, there is always an article tag, and then the photo is in the 4th div in, right under the header.
You can do this with BeautifulSoup with something like this:
from BeautifulSoup import BeautifulSoup as soup
article = soup.find('article')
divs_in_article = article.find_all('div')
divs_in_article[3] should have the data you are looking for. If BeautifulSoup grabs dives under that first header tag, you may have to get creative and skip that tag first. I would test it myself but I don't have ChromeDriver running right now.
Alternatively, you could try:
images = soup.find_all('img')
to grab all image tags in the page. This may work too.
BeautifulSoup has a lot of handy methods to get you tagging things based on structure. Take a look at going back and forth, going sideways , going down and going up. You should be able to discern the structure using the developer tools in your browser and then come up with a way to select the collections you care about for comments.

Scraping webpage with "nonsensical" tags

I am attempting to build a web scraper to aggregate information on state level House and Senate bills. I am using Python and I can pull the HTML from the page, but parsing it is giving me difficulty. For example, the New Hampshire bill page wraps information in tags with "nonsensically" named tags. Here is an example page: http://www.gencourt.state.nh.us/bill_status/billText.aspx?sy=2017&id=14&txtFormat=html. How would I go about pulling, for example, the number of the bill, from the long list of tags?
If I had to guess, I'd say that markup was generated by some sort of WYSIWYG editor. (The presence of invalid CSS properties like tab-stops suggests that it might be output from a word processor.) If this is the case, the exact usage of classes in the output is unlikely to be consistent between documents.
With this in mind, your best bet will probably be to ignore the markup entirely and parse the text.
Open the page in a browser, right-click on something you want to be able to pull, and use Inspect, to see the class name used for that element. For instance, if you inspect the bill number, you'll see that it's
<span class="cs4904F745">76</span>
So in your web scraping code, search for the class cs4904F745 to get the bill number. These things may look random, but I've checked a few documents and they're consistent.
You can use the BeautifulSoup library to parse the HTML and search for what you want.

Parsing multiple News articles

I have built a program for summarization that utilizes a parser to parse from multiple websites at a time. I extract only <p> in each article.
This throws out a lot of random content that is unrelated to the article. I've seen several people who can parse any article perfectly. How can i do it? I am using Beautiful Soup
Might be worth you trying an existing package like python-goose which does what it sounds like you're asking for, extracting article content from web pages.
Your solution is really going to be specific to each website page you want to scrape, so, without knowing the websites of interest, the only thing I could really suggest would be to inspect the page source of each page you want to scrape and look if the article is contained in some html element with a specific attribute (either a unique class, id, or even summary attribute) and then use beautiful soup to get the inner html text from that element

Parsing a webpage for indexing

I am trying to understand/optimize the logic for indexing a site. I am new to HTML/JS side of things and so am learning as I go. While indexing a site, I recursively go deeper into the site based on the links on each page. One problem is pages have repeating URLs and text like the header and footer. For the URLs I have a list of URLs I have already processed. Is there something I can do for identifying the text that repeats on each page? I hope my explanation is clear enough. I currently have the code (in python) to get a list of useful URLs for that site. Now I am trying to index the content of these pages. Is there a preferred logic to identify or skip repeating text on these pages (like headers, footers, other blurb). I am using BeautifulSoup + the requests module.
I am not quite sure if this is what you are hoping for, but readability is a popular service that just parses the "useful" content from a page. This is the service that is integrated into safari for ios.
It intelligently gets the worthwhile content of the page while ignorning things like footer/header/ads/etc
There are open source ports for python/ruby/php and probably other languages.
https://github.com/buriy/python-readability

Python Text Extraction from parsed web pages

I'm working to develop a small system for extracting content from web pages (I know it has been done, but it is a good exercise and something I need). Basically, I'm looking to extract content-content, i.e. if it is an article, I just want the article text and nothing else.
I've just started, so consider me a dumb blank slate. I'm interested in how you do it, and with what, specifically in python but I'd be interested in any
EDIT:
I've found this rather enlightening and more in tune with what I'm trying to do, so solutions, discussion, and library suggestions along 'this type of thing' appreciated.
I have done a little bit of this and I recommend the combination of Mechanize and BeautifulSoup.
I would recommend parsing the HTML tree with beautiful soup and looking for a distinctive tag that identifies the content, perhaps:
<div id="article">
Then you can just take that node from the "soup".

Categories

Resources