Text Extracting: Used All Methods, Yet Stuck - python

I want to extract a few text out of a webpage. I searched StackOverFlow (as well as other sites) to find a proper method. I used HTML2TEXT, BEAUTIFULSOUP, NLTK and some other manual methods to do extraction and I failed for example:
HTML2TEXT works on offline (=saved pages) and I need to do it online.
BS4 won't work properly on Unicode (My page is in UTF8 Persian encoding) and it won't extract the text. It also returns HTML tags\codes. I only need rendered text.
NLTK won't work on my Persian text.
Even while trying to open my page with urllib.request.urlopen I encounter some errors.
So as you see I'm so much stuck after trying several methods.
Here's my target URL: http://vynylyn.yolasite.com/page2.php
I want to extract only Persian paragraphs without tags\codes.
(Note: I use Eclipse Kepler w\ Python 34 also I want to extract text then I want to do POS Tagging, Word\Sentence Tokenizing, etc on the text.)
What are my options to get this working?

I'd go for your second option at first. BeautifulSoup 4 should (and does) definitely support unicode (note it's UTF-8, a global character encoding, so there's nothing Persian about it).
And yes, you will get tags, as it's an HTML page. Try searching for a unique ID, or look at the HTML structure on the page(s). For your example, look for element main and then content elements below that, or maybe use div#I1_sys_txt in that specific page. Once you have your element, you just need to call get_text().
Try this (now in Python 3):
#!/usr/bin/env python3
import requests
from bs4 import BeautifulSoup
content = requests.get('http://vynylyn.yolasite.com/page2.php')
soup = BeautifulSoup(content.text)
tag = soup.find('div', id='I1_sys_txt')
print(tag.get_text() if tag else "<none found>")

Related

Crawling text of a specific heading for any web page URL document in python

I have searched and get a little bit introduced to some of the web crawling libraries in python like scrapy, beautifulsoup etc. Using these libraries I want to crawl all of the text under a specific heading in a document. If any of you can help me his/her help would be highly appreciated. I have seen some tutorial that how one can get links under a specific class name (by view source page option) using beautiful soap but how can I get a simple text not links under the specific class of heading. Sorry for my bad English
import requests
from bs4 import BeautifulSoup
r=requests.get('https://patents.google.com/patent/US6886010B2/en')
print(r.content)
soup=BeautifulSoup(r.content)
for link in soup.find_all("div", class_="claims"):
print(link)
Here i have extracted claims text but it also shows other div written in these claims that is div in div i just want to extract the text of the claims only.
By links, I assume you mean the entire contents of the div elements. If you'd like to just print the text contained within them, use the .text attribute or .get_text() method. The entire text of the claims is wrapped inside a unique section element. So you might want to try this:
print(soup.find('section', attrs={'id': 'claims'}).text)
The get_text method gives you a bit more flexibility such as joining bits of text together with a separator and stripping the text of extra newlines.
Also, take a look at the BeautifulSoup Documentation and spend some time reading it.

Extracting data from webpage using lxml XPath in Python

I am having some unknown trouble when using xpath to retrieve text from an HTML page from lxml library.
The page url is www.mangapanda.com/one-piece/1/1
I want to extract the selected chapter name text from the drop down select tag. Now I just want the first option so the XPath to find that is pretty easy. That is :-
.//*[#id='chapterMenu']/option[1]/text()
I verified the above using Firepath and it gives correct data. but when I am trying to use lxml for the purpose I get not data at all.
from lxml import html
import requests
r = requests.get("http://www.mangapanda.com/one-piece/1/1")
page = html.fromstring(r.text)
name = page.xpath(".//*[#id='chapterMenu']/option[1]/text()")
But in name nothing is stored. I even tried other XPath's like :-
//div/select[#id='chapterMenu']/option[1]/text()
//select[#id='chapterMenu']/option[1]/text()
The above were also verified using FirePath. I am unable to figure out what could be the problem. I would request some assistance regarding this problem.
But it is not that all aren't working. An xpath that working with lxml xpath here is :-
.//img[#id='img']/#src
Thank you.
I've had a look at the html source of that page and the content of the element with the id chapterMenu is empty.
I think your problem is that it is filled using javascript and javascript will not be automatically evaluated just by reading the html with lxml.html
You might want to have a look at this:
Evaluate javascript on a local html file (without browser)
Maybe you're able to trick it though... In the end, also javascript needs to fetch the information using a get request. In this case it requests: http://www.mangapanda.com/actions/selector/?id=103&which=191919
Which is json and can be easily turned into a python dict/array using the json library.
But you have to find out how to get the id and the which parameter if you want to automate this.
The id is part of the html, look for document['mangaid'] within one of the script tags and which can maybe stay 191919 has to be 0... although I couldn't find it in any source I found it, when it is 0 you will be redirected to the proper url.
So there you go ;)
The source document of the page you are requesting is in a default namespace:
<html xmlns="http://www.w3.org/1999/xhtml">
even if Firepath does not tell you about this. The proper way to deal with namespaces is to redeclare them in your code, which means associating them with a prefix and then prefixing element names in XPath expressions.
name = page.xpath('//*[#id='chapterMenu']/xhtml:option[1]/text()',
namespaces={'xhtml': 'http://www.w3.org/1999/xhtml'})
Then, the piece of the document the path expression above is concerned with is:
<select id="chapterMenu" name="chapterMenu"></select>
As you can see, there is no option element inside it. Please tell us what exactly you'd like to find.

Timetable Web Scraping with multiple tables (Python)

I'm just looking for some info regarding python web scraping. I'm trying to get all the data from this timetable and I want to have the class linked to the time its on at. Looking at the html there's multiple tables (tables within tables). I'm planning to use Google App Engine with Python (perhaps BeautifulSoup also). Any suggestions on the best way of going about this is?
Thanks
UPDATE:
I've managed to extract the required data from the table using the following code:
import urllib
from lxml import etree
import StringIO
url = "http://ttcache.dcu.ie/Reporting/Individual;Locations;id;lg25?
template=location+Individual&weeks=20&days=1-5&periods=1-30&Width=0&Height=0"
result = urllib.urlopen(url)
html = result.read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(html), parser)
xpath = "//table[2]/tr/td//text()"
filtered_html = tree.xpath(xpath)
print filtered_html
But I'm getting a lot of these u'\xa0', u'\xa0', '\r\n', '\r\n' characters scattered throughout the parsed text. Any suggestions on how I could combat these?
Thanks
The best library available for parsing HTML is lxml, which is based on libxml2. Although it's intended for XML parsing it also has a HTML parser that deals with tag soup far better than BeautifulSoup does. Due to the parser being in C it's also much much faster.
You'll also get access to XPath to query the HTML dom, with the libxml2 support for regular expression matches in XPaths which is very useful for web scraping.
libxml2 and lxml are very well supported and you'll find there are packages for them on all major distros. Google App engine appears to support it as well if you're using 2.7 https://developers.google.com/appengine/docs/python/tools/libraries27
EDIT:
The characters you're getting are due to there being a lot of empty table cells on the page, so your xpath is often matching the whitespace characters (which are non-breaking spaces). You can skip those text nodes with no non-space characters with a regular expression something like this:
xpath = "//table[2]/tr/td//text()[re:match(., '\\S')]"
filtered_html = tree.xpath(
xpath,
namespaces={"re": "http://exslt.org/regular-expressions"})
The namespaces bit just tells lxml that you want to use it's regular expression extension.

How to parse a wikipedia page in Python?

I've been trying to parse a wikipedia page in Python and have been quite successful using the API.
But, somehow the API documentation seems a bit too skeletal for me to get all the data.
As of now, I'm doing a requests.get() call to
http://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=China&format=json&exintro=1
But, this only returns me the first paragraph. Not the entire page. I've tried to use allpages and search but to no avail. A better explanation of how to get the data from a wiki page would be of real help. All the data and not just the introduction as returned by the previous query.
You seem to be using the query action to get the content of the page. According to it's api specs it returns only a part of the data. The proper action seems to be query.
Here is a sample
import urllib2
req = urllib2.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
content = req.read()
# content in json - use json or simplejson to get relevant sections.
Have you considered using Beautiful Soup to extract the content from the page?
While I haven't used this for wikipedia, others have, and having used it to scrape other pages and it is an excellent tool.
If someone is lookin for a python3 answer here you go:
import urllib.request
req = urllib.request.urlopen("http://en.wikipedia.org/w/api.php?action=parse&page=China&format=json&prop=text")
print(req.read())
I'm using python version 3.7.0b4.

html to text conversion using python language

I'd like to extract the text from an HTML file using Python. I want essentially the same output I would get if I copied the text from a browser and pasted it into notepad.
I'd like something more robust than using regular expressions that may fail on poorly formed HTML. I've seen many people recommend Beautiful Soup, but I've had a few problems using it. For one, it picked up unwanted text, such as JavaScript source. Also, it did not interpret HTML entities. For example, I would expect ' in HTML source to be converted to an apostrophe in text, just as if I'd pasted the browser content into notepad.
Update: html2text looks promising. It handles HTML entities correctly and ignores JavaScript. However, it does not exactly produce plain text; it produces markdown that would then have to be turned into plain text. It comes with no examples or documentation, but the code looks clean.
you would need to use urllib2 python library to get the html from the website and then parse through the html to grab the text that you want.
Use BeautifulSoup to parse through the html
import BeautifulSoup
resp = urllib2.urlopen("http://stackoverflow.com")
rawhtml = resp.read()
#parse through html to get text
soup=BeautifulSoup(rawhtml)
I don't "copy-paste from browser" is a well-defined operation. For instance, what would happen if the entire page were covered with a transparent floating div? What if it had tables? What about dynamic content?
BeautifulSoup is a powerful parser; you just need to know how to use it (it is easy, for instance, to remove the script tags from the page). Fortunately, it has a lot of documentation.
You can use xml.sax.utils.unescape to unescape HTML entities.

Categories

Resources