Python HTML scraping - python

It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example:
<a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e">
I want to get the href value. Any ideas on how to do this? Maybe regex? Could you post some example code?
I'm guessing html scraping libs, such as BeautifulSoup, are a bit of overkill just for this...
Huge thanks!

Regex is usally a bad idea, try using BeautifulSoup
Quick example:
html = #get html
soup = BeautifulSoup(html)
links = soup.findAll('a', attrs={'class': 'myclass'})
for link in links:
#process link

Aargh, not regex for parsing HTML!
Luckily in Python we have BeautifulSoup or lxml to do that job for us.

Regex would be a bad choice. HTML is not a regular language. How about Beautiful Soup?

Regex should not be used to parse HTML. See the first answer to this question for an explanation :)
+1 for BeautifulSoup.

If your task is just this simple, just use string manipulation (without even regex)
f=open("htmlfile")
for line in f:
if "<a class" in line and "myClass" in line and "href" in line:
s = line [ line.index("href") + len('href="') : ]
print s[:s.index('">')]
f.close()
HTML parsers is not a must for such cases.

The thing is I know the structure of the HTML page, and I just want to find that specific kind of links (where class="myclass"). BeautifulSoup anyway?

read Parsing Html The Cthulhu Way https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

Related

Use python to find html or js tags. (regex?)

I am also open to other solutions other than using regex.
would checking angle brackets be enough?
any suggestions? Thanks!
Edit: what I need is NOT to parse html tags but just to check it has those tags or not
You can use BeautifulSoup parser and check if there are any tags by iterating the BeautifulSoup object and checking if there is at least one Tag element:
from bs4 import BeautifulSoup, Tag
l = ['test', 'test <br>', '<br>']
for item in l:
soup = BeautifulSoup(item, 'html.parser')
print item, any(isinstance(element, Tag) for element in soup)
prints:
test False
test <br> True
<br> True
Hope that helps.
I highly recommend lxml.html to do anything regarding parsing (xml, html, xhtml ...)
to get the whole idea just have a quick look at these graphs and you will know what i am talking about ;)
for a more detailed comparision have a look here.

How to copy all the text from url (like [Ctrl+A][Ctrl+C] with webbrowser) in python?

I know there is the easy way to copy all the source of url, but it's not my task. I need exactly save just all the text (just like webbrowser user copy it) to the *.txt file.
Is it unavoidable to parse source code html for it, or there is a better way?
I think it is impossible if you don't parse at all. I guess you could use HtmlParser http://docs.python.org/2/library/htmlparser.html and just keep the data tags, but you will most likely get many other elements than you want.
To get exactly the same as [Ctrl-C] would be very difficult to avoid parsing because of things like the style="display: hidden;" which would hide the text, which again will result in full parsing of html, javascript and css of both the document and resource files.
Parsing is required. Don't know if there's a library method. A simple regex:
text = sub(r"<[^>]+>", " ", html)
this requires many improvements, but it's a starting point.
With python, the BeautifulSoup module is great for parsing HTML, and well worth a look. To get the text from a webpage, it's just a case of:
#!/usr/env python
#
import urllib2
from bs4 import BeautifulSoup
url = 'http://python.org'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
# you can refine this even further if needed... ie. soup.body.div.get_text()
text = soup.body.get_text()
print text

Beautifulsoup, Python and HTML automatic page truncating?

I'm using Python and BeautifulSoup to parse HTML pages. Unfortunately, for some pages (> 400K) BeatifulSoup is truncating the HTML content.
I use the following code to get the set of "div"s:
findSet = SoupStrainer('div')
set = BeautifulSoup(htmlSource, parseOnlyThese=findSet)
for it in set:
print it
At a certain point, the output looks like:
correct string, correct string, incomplete/truncated string ("So, I")
although, the htmlSource contains the string "So, I am bored", and many others. Also, I would like to mention that when I prettify() the tree I see the HTML source truncated.
Do you have an idea how can I fix this issue?
Thanks!
Try using lxml.html. It is a faster, better html parser, and deals better with broken html than latest BeautifulSoup. It is working fine for your example page, parsing the entire page.
import lxml.html
doc = lxml.html.parse('http://voinici.ceata.org/~sana/test.html')
print len(doc.findall('//div'))
Code above returns 131 divs.
I found a solution to this problem using BeautifulSoup at beautifulsoup-where-are-you-putting-my-html, because I think it is easier than lxml.
The only thing you need to do is to install:
pip install html5lib
and add it as a parameter to BeautifulSoup:
soup = BeautifulSoup(html, 'html5lib')

python url fetch help - regex

I have a web site where there are links like <a href="http://www.example.com?read.php=123"> Can anybody show me how to get all the numbers (123, in this case) in such links using python? I don't know how to construct a regex. Thanks in advance.
import re
re.findall("\?read\.php=(\d+)",data)
"If you have a problem, and decide to use regex, now you have two problems..."
If you are reading one particular web page and you know how it is formatted, then regex is fine - you can use S. Mark's answer. To parse a particular link, you can use Kimvai's answer. However, to get all the links from a page, you're better off using something more serious. Any regex solution you come up with will have flaws,
I recommend mechanize. If you notice, the Browser class there has a links method which gets you all the links in a page. It has the added benefit of being able to download the page for you =) .
This will work irrespective of how your links are formatted (e.g. if some look like <a href="foo=123"/> and some look like <A TARGET="_blank" HREF='foo=123'/>).
import re
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(html)
p = re.compile('^.*=([\d]*)$')
for a in soup.findAll('a'):
m = p.match(a["href"])
if m:
print m.groups()[0]
While the other answers are sort of correct, you should probably use the urllib2 library instead;
from urllib2 import urlparse
import re
urlre = re.compile('<a[^>]+href="([^"]+)"[^>]*>',re.IGNORECASE)
links = urlre.findall('<a href="http://www.example.com?read.php=123">')
for link in links:
url = urlparse.urlparse(link)
s = [x.split("=") for x in url[4].split(';')]
d = {}
for k,v in s:
d[k]=v
print d["read.php"]
It's not as simple as some of the above, but guaranteed to work even with more complex urls.
/[0-9]/
thats the regex sytax you want
for reference see
http://gnosis.cx/publish/programming/regular_expressions.html
One without the need for regex
>>> s='<a href="http://www.example.com?read.php=123">'
>>> for item in s.split(">"):
... if "href" in item:
... print item[item.index("a href")+len("a href="): ]
...
"http://www.example.com?read.php=123"
if you want to extract the numbers
item[item.index("a href")+len("a href="): ].split("=")[-1]

Decomposing HTML to link text and target

Given an HTML link like
texttxt
how can I isolate the url and the text?
Updates
I'm using Beautiful Soup, and am unable to figure out how to do that.
I did
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
links = soup.findAll('a')
for link in links:
print "link content:", link.content," and attr:",link.attrs
i get
*link content: None and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root /support.asp')]* ...
...
Why am i missing the content?
edit: elaborated on 'stuck' as advised :)
Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.
EDIT:
I think you want:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.
EDIT 2:
This should show you all the links in a page:
import urlparse, urllib
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source)
for item in soup.fetchall('a'):
try:
link = urlparse.urlparse(item['href'].lower())
except:
# Not a valid link
pass
else:
print link
Here's a code example, showing getting the attributes and contents of the links:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
print link.attrs, link.contents
Looks like you have two issues there:
link.contents, not link.content
attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.
Though I suppose the others might be correct in pointing you to using Beautiful Soup, they might not, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.
/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/
Here's what it matches:
'text'
// Parts: "url", "text"
'text<span>something</span>'
// Parts: "url", "text<span>something</span>"
If you wanted to get just the text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.

Categories

Resources