Decomposing HTML to link text and target - python

Given an HTML link like
texttxt
how can I isolate the url and the text?
Updates
I'm using Beautiful Soup, and am unable to figure out how to do that.
I did
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
links = soup.findAll('a')
for link in links:
print "link content:", link.content," and attr:",link.attrs
i get
*link content: None and attr: [(u'href', u'_redirectGeneric.asp?genericURL=/root /support.asp')]* ...
...
Why am i missing the content?
edit: elaborated on 'stuck' as advised :)

Use Beautiful Soup. Doing it yourself is harder than it looks, you'll be better off using a tried and tested module.
EDIT:
I think you want:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url).read())
By the way, it's a bad idea to try opening the URL there, as if it goes wrong it could get ugly.
EDIT 2:
This should show you all the links in a page:
import urlparse, urllib
from BeautifulSoup import BeautifulSoup
url = "http://www.example.com/index.html"
source = urllib.urlopen(url).read()
soup = BeautifulSoup(source)
for item in soup.fetchall('a'):
try:
link = urlparse.urlparse(item['href'].lower())
except:
# Not a valid link
pass
else:
print link

Here's a code example, showing getting the attributes and contents of the links:
soup = BeautifulSoup.BeautifulSoup(urllib.urlopen(url))
for link in soup.findAll('a'):
print link.attrs, link.contents

Looks like you have two issues there:
link.contents, not link.content
attrs is a dictionary, not a string. It holds key value pairs for each attribute in an HTML element. link.attrs['href'] will get you what you appear to be looking for, but you'd want to wrap that in a check in case you come across an a tag without an href attribute.

Though I suppose the others might be correct in pointing you to using Beautiful Soup, they might not, and using an external library might be massively over-the-top for your purposes. Here's a regex which will do what you ask.
/<a\s+[^>]*?href="([^"]*)".*?>(.*?)<\/a>/
Here's what it matches:
'text'
// Parts: "url", "text"
'text<span>something</span>'
// Parts: "url", "text<span>something</span>"
If you wanted to get just the text (eg: "textsomething" in the second example above), I'd just run another regex over it to strip anything between pointed brackets.

Related

Parsing specific values in multiple pages

I have the following code with a purpose to parse specific information from each of multiple pages. The http of each of the multiple pages is structured and therefore I use this structure to collect all links at the same time for further parsing.
import urllib
import urlparse
import re
from bs4 import BeautifulSoup
Links = ["http://www.newyorksocialdiary.com/party-pictures?page=" + str(i) for i in range(2,27)]
This command gives me a list of http links. I go further to read in and make soups.
Rs = [urllib.urlopen(Link).read() for Link in Links]
soups = [BeautifulSoup(R) for R in Rs]
As these make the soups that I desire, I cannot achieve the final goal - parsing structure . For instance,
Something for Everyone
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'. However, the code below cannot serve this purpose.
As = [soup.find_all('a', attr = {"href"}) for soup in soups]
Could someone tell me where went wrong? I highly appreciate your assistance. Thank you.
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'.
The next would be going for regular expression!!
You don't necessarily need to use regular expressions, and, from what I understand, you can filter out the desired links with BeautifulSoup:
[[a["href"] for a in soup.select('a[href*=party-pictures]')]
for soup in soups]
This, for example, would give you the list of links having party-pictures inside the href. *= means "contains", select() is a CSS selector search.
You can also use find_all() and apply the regular expression filter, for example:
pattern = re.compile(r"/party-pictures/2007/")
[[a["href"] for a in soup.find_all('a', href=pattern)]
for soup in soups]
This should work :
As = [soup.find_all(href=True) for soup in soups]
This should give you all href tags
If you only want hrefs with name 'a', then the following would work :
As = [soup.find_all('a',href=True) for soup in soups]

Python: Reading a webpage and extracting text from that page

I'm writing in Python to try and get exchange rates from the website:
xe.com/currency/converter (I can't post another link, sorry - I'm at limit)
I want to be able to get rates from this file, for example, for the conversion between GBP and USD:
Therefore, I would search the url: "http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD" , then get the value printed "1.56371 USD" (the rates at the time I was writing this message), and assign that value as an int to a variable, like rate_usd.
At the moment, I was thinking about using the BeautifulSoup module and urllib.request module, and request the url ("http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD") and search through it using BeautifulSoup. At the moment, I'm at this stage in the coding:
import urllib.request
import bs4 from BeautifulSoup
def rates_fetcher(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
# code to search through soup and fetch the converted value
# e.g. 1.56371
# How would I extract this value?
# I have inspected the page element and found the value I want to be in the class:
# <td width="47%" align="left" class="rightCol">1.56371
# I'm thinking about searching through the class: class="rightCol"
# and extracting the value that way, but how?
url1 = "http://www.xe.com/currencyconverter/convert/?Amount=1&From=GBP&To=USD"
rates_fetcher(url1)
Any help would be much appreciated, and thank you whoever took the time to read this.
p.s. Sorry in advance if I have made any typos, I'm kinda' in a hurry :s
It sounds like you've got the right idea.
def rates_fetcher(url):
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html)
return [item.text for item in soup.find_all(class_='rightCol')]
That should do it... This will return a list of the text inside any tag with the class 'rightCol'.
If you haven't read through the Beautiful Soup documentation, you really oughtta. It's straightforward and very useful.
Try pyquery. It's a lot better than Soup.
PS: For urllib, try Requests: Http for humans
PS2: Actually I use Node and jQuery/jQuery-like for html scrapping at last.

Using BeautifulSoup to search for a tag with certain attributes, but NOT fixed attribute values

I have the following HTML that I'm trying to search via beautifulsoup:
<a href="/user_details?userid=**********">
The problem is I want to pull out all links with the above format, but the userid can change. How would I go about doing that?
Thanks!
Would something like
for link in soup.find_all('a'):
link_href = link.get('href')
if 'user_details' in link_href:
# found match
print link_href
do the trick?
We list all a (link) urls,
search for the string 'user_details' in each link,
if it's found - do work with the link.
If user_details matches something else, you could change it to something else.

Python 3 Beautiful Soup Data type incompatibility issue

Hello there stack community!
I'm having an issue that I can't seem to resolve since it looks like most of the help out there is for Python 2.7.
I want to pull a table from a webpage and then just get the linktext and not the whole anchor.
Here is the code:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = 'http://www.craftcount.com/category.php?cat=5'
html = urlopen(url).read()
soup = BeautifulSoup(html)
alltables = soup.findAll("table")
## This bit captures the input from the previous sequence
results=[]
for link in alltables:
rows = link.findAll('a')
## Find just the names
top100 = re.findall(r">(.*?)<\/a>",rows)
print(top100)
When I run it, I get: "TypeError: expected string or buffer"
Up to the second to the last line, it does everything correctly (when I swap out 'print(top100)' for 'print(rows)').
As an example of the response I get:
michellechangjewelry
And I just need to get:
michellechangjewelry
According to pythex.org, my (ir)regular expression should work, so I wanted to see if anyone out there knew how to do that. As an additional issue, it looks like most people like to go the other way, that is, from having the full text and only wanting the URL part.
Finally, I'm using BeautifulSoup out of "convenience", but I'm not beholden to it if you can suggest a better package to narrow down the parsing to the linktext.
Many thanks in advance!!
BeautifulSoup results are not strings; they are Tag objects, mostly.
Look for the text of the <a> tags, use the .string attribute:
for table in alltables:
link = table.find('a')
top100 = link.string
print(top100)
This finds the first <a> link in a table. To find all text of all links:
for table in alltables:
links = table.find_all('a')
top100 = [link.string for link in links]
print(top100)

Python HTML scraping

It's not really scraping, I'm just trying to find the URLs in a web page where the class has a specific value. For example:
<a class="myClass" href="/url/7df028f508c4685ddf65987a0bd6f22e">
I want to get the href value. Any ideas on how to do this? Maybe regex? Could you post some example code?
I'm guessing html scraping libs, such as BeautifulSoup, are a bit of overkill just for this...
Huge thanks!
Regex is usally a bad idea, try using BeautifulSoup
Quick example:
html = #get html
soup = BeautifulSoup(html)
links = soup.findAll('a', attrs={'class': 'myclass'})
for link in links:
#process link
Aargh, not regex for parsing HTML!
Luckily in Python we have BeautifulSoup or lxml to do that job for us.
Regex would be a bad choice. HTML is not a regular language. How about Beautiful Soup?
Regex should not be used to parse HTML. See the first answer to this question for an explanation :)
+1 for BeautifulSoup.
If your task is just this simple, just use string manipulation (without even regex)
f=open("htmlfile")
for line in f:
if "<a class" in line and "myClass" in line and "href" in line:
s = line [ line.index("href") + len('href="') : ]
print s[:s.index('">')]
f.close()
HTML parsers is not a must for such cases.
The thing is I know the structure of the HTML page, and I just want to find that specific kind of links (where class="myclass"). BeautifulSoup anyway?
read Parsing Html The Cthulhu Way https://blog.codinghorror.com/parsing-html-the-cthulhu-way/

Categories

Resources