Webscraping the links from a table - python

I want to web scrape the links and their respective texts from a table. I plan to use regex to accomplish this.
So let's say in this page I have multiple text_i tags. I want to get all the text_i's into a list and then get all the href's into a separate list.
I have:
web = requests.get(url)
web_text = web.text
texts = re.findall(r'<table .*><a .*>(.*)</a></table>, web_text)'
The regex expression finds all the anchor tags, of whatever class, inside a HTML table of whatever class and returns the texts, correct? This is taking an extraordinarily long time. Is this the correct way to do it?
Also, how do I go about getting the href url's now?

I suggest you use Beautiful Soup to parse the HTML text of the table.
Adapted from Beautiful Soup's documentation you could do for example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(web_text, 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))

Related

How to extract only a specific kind of link from a webpage with beautifulsoup4

I'm trying to extract specific links on a page full of links. The links I need contain the word "apartment" in them.
But whatever I try, I get way more data extracted than only the links I need.
<a href="https://www.website.com/en/ad/apartment/abcd123" title target="IWEB_MAIN">
If anyone could help me out on this, it'd be much appreciated!
Also, if you have a good source that could inform me better about this, it would be double appreciated!
Yon can use regular expression re.
import re
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.find_all("a",attrs={"href" : re.compile("apartment")})
for item in alltags:
print(item['href']) #grab href value
Or You can use css selector
soup=BeautifulSoup(Pagesource,'html.parser')
alltags=soup.select("a[href*='apartment']")
for item in alltags:
print(item['href'])
You find the details in official documents Beautifulsoup
Edited:
You need to consider parent div first then find the anchor tag.
import requests
from bs4 import BeautifulSoup
res=requests.get("https://www.immoweb.be/en/search/apartment/for-sale/leuven/3000")
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select("div[data-type='resultgallery-resultitem'] >a[href*='apartment']"):
print(item['href'])

Equivalent regular expression to extract link using Beautiful Soup

I am trying to randomly explore Webscraping through python.I have link of google search results page. I used url lib to extract all the links which are present in the GOOGLE SEARCH RESULT PAGE. From that parsed page of google I am extracting all possible anchor tags with the help of Beautiful Soup library. So now I have lots of links. Among those I want to pick selected links which matches my required pattern.
Example I want to pick all such lines:
This is one of the many links that got parsed. But I want to narrow down the result of the links which are like this
/url?q=http://avadl.uploadt.com/DL4/Film/&sa=U&ved=0ahUKEwiYwOKe1r7hAhWUf30KHcHUBkMQFggUMAA&usg=AOvVaw39cIJ0T8_CAQMY8EkSWZJl
And among such picks I need to extract only this part
http://avadl.uploadt.com/DL4/Film/
I tried this and this
possible_websites.append(re.findall('/url?q=(\S+)',links))
possible_websites.append(re.findall('/url?q=(\S+^&)',links))
Here's my code
soup = BeautifulSoup(webpage, 'html.parser')
tags = soup('a')
possible_websites=[]
for tag in tags:
links = tag.get('href', None)
possible_websites.append(re.findall('/url?q=(\S+)',links))
I want to use regular expression to extract the required text part. I am using Beautiful soup module to extract the HTML data. In short this is much of a reguar expression problem.
It’s not regex, but I’d use urllib:
from urllib.parse import parse_qs, urlparse
url = urlparse('/url?q=http://avadl.uploadt.com/DL4/Film/&sa=U&ved=0ahUKEwiYwOKe1r7hAhWUf30KHcHUBkMQFggUMAA&usg=AOvVaw39cIJ0T8_CAQMY8EkSWZJl')
qs = parse_qs(url.query)
print(qs['q'][0])
If you really need a regex, use q=(.*/)& otherwise go with Ry-'s answer, i.e.:
import re
u = "/url?q=http://avadl.uploadt.com/DL4/Film/&sa=U&ved=0ahUKEwiYwOKe1r7hAhWUf30KHcHUBkMQFggUMAA&usg=AOvVaw39cIJ0T8_CAQMY8EkSWZJl"
m = re.findall("q=(.*/)&", u)
if m:
print(m[0])
# http://avadl.uploadt.com/DL4/Film/
Demo

What is the most efficient way to get a specific link using Beautiful Soup in Python 3.0?

I am currently learning Python specialization on coursera. I have come across the issue of extracting a specific link from a webpage using BeautifulSoup. From this webpage (http://py4e-data.dr-chuck.net/known_by_Fikret.html), I am supposed to extract a URL from user input and open that subsequent links, all identified through the anchor tab and run some number of iterations.
While I able to program them using Lists, I am wondering if there is any simpler way of doing it without using Lists or Dictionary?
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
nameList=list()
loc=''
count=0
for tag in tags:
loc=tag.get('href',None)
nameList.append(loc)
url=nameList[pos-1]
In the above code, you would notice that after locating the links using 'a' tag and 'href', I cant help but has to create a list called nameList to locate the position of link. As this is inefficient, I would like to know if I could directly locate the URL without using the lists. Thanks in advance!
The easiest way is to get an element out of tags list and then extract href value:
tags = soup('a')
a = tags[pos-1]
loc = a.get('href', None)
You can also use soup.select_one() method to query :nth-of-type element:
soup.select('a:nth-of-type({})'.format(pos))
As :nth-of-type uses 1-based indexing, you don't need to subtract 1 from pos value if your users are expected to use 1-based indexing too.
Note that soup's :nth-of-type is not equivalent to CSS :nth-of-type pseudo-class, as it always selects only one element, while CSS selector may select many elements at once.
And if you're looking for "the most efficient way", then you need to look at lxml:
from lxml.html import fromstring
tree = fromstring(r.content)
url = tree.xpath('(//a)[{}]/#href'.format(pos))[0]

Parsing specific values in multiple pages

I have the following code with a purpose to parse specific information from each of multiple pages. The http of each of the multiple pages is structured and therefore I use this structure to collect all links at the same time for further parsing.
import urllib
import urlparse
import re
from bs4 import BeautifulSoup
Links = ["http://www.newyorksocialdiary.com/party-pictures?page=" + str(i) for i in range(2,27)]
This command gives me a list of http links. I go further to read in and make soups.
Rs = [urllib.urlopen(Link).read() for Link in Links]
soups = [BeautifulSoup(R) for R in Rs]
As these make the soups that I desire, I cannot achieve the final goal - parsing structure . For instance,
Something for Everyone
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'. However, the code below cannot serve this purpose.
As = [soup.find_all('a', attr = {"href"}) for soup in soups]
Could someone tell me where went wrong? I highly appreciate your assistance. Thank you.
I am specifically interested in obtaining things like this: '/party-pictures/2007/something-for-everyone'.
The next would be going for regular expression!!
You don't necessarily need to use regular expressions, and, from what I understand, you can filter out the desired links with BeautifulSoup:
[[a["href"] for a in soup.select('a[href*=party-pictures]')]
for soup in soups]
This, for example, would give you the list of links having party-pictures inside the href. *= means "contains", select() is a CSS selector search.
You can also use find_all() and apply the regular expression filter, for example:
pattern = re.compile(r"/party-pictures/2007/")
[[a["href"] for a in soup.find_all('a', href=pattern)]
for soup in soups]
This should work :
As = [soup.find_all(href=True) for soup in soups]
This should give you all href tags
If you only want hrefs with name 'a', then the following would work :
As = [soup.find_all('a',href=True) for soup in soups]

How does table parsing work in python? Is there an easy way other that beautiful soup?

I am trying to understand how one can use beautiful soup to extract the href links for the contents under a particular column in a table on a webpage. For example consider the link: http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015.
On this page the table with class wikitable has a column title, I need to extract the href links that are behind each of the values under the column title and put them in an excel sheet. What would be the best way to do this? I am having a little difficulty in understanding the beautiful soup table parsing documentation.
You don't really have to literally navigate the tree, you can simply try to see what identifies those lines.
Like in this example, the urls you are looking for reside in a table with class="wikitable", in that table they reside in a td tag with align=center, now we have a somewhat unique identification for our links, we can start extracting them.
However you should put into consideration that multiple tables with class="wikitable" and td tags with align=center may exist, in case you want the first or second table, it depends on your choice, you will have to add extra filters.
The code should look something like this for extracting all links from those tables:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))
There's one more thing to note here, notice the use of SoupStrainer, it's used to specify a filter for reading the content you want to process, it helps to speed the process, try to not use the parse_only argument on this line:
soup = BeautifulSoup(content, parse_only=filter_tag)
and notice the difference. (I noticed it because my pc is not that powerful)

Categories

Resources