Python Beatiful Soup (Select only the one class with same name) - python

I am using Beautiful Soup to parse through elements of an email and I have successfully been able to extract the links from a button from the email. However, the class name on the button appears twice in the email HTML, therefore extracting/ printing two links. I only need one the first link or reference to the class first with the same name.
This is my code:
soup = BeautifulSoup(msg.html, 'html.parser')
for link in soup('a', class\_='mcnButton', href=True):
print(link\['href'\])
The 'mcnButton' is referencing two html buttons within the email containing two seperate links.I only need the first reference to the 'mcnButton' class and link containing.
The above codes prints out two links (again I only need the first).
https://leemarpet.us10.list-manage.com/track/XXXXXXX1
https://leemarpet.us10.list-manage.com/track/XXXXXXX2
I figured there should be a way to index and separately access the first reference to the class and link. Any assistance would be greatly appreciated, Thanks!
I tried the select_one, find, and attempts to index the class, unfortunately resulted in a syntax error.

To find only the first element matching your pattern use .find():
soup.find('a', class_='mcnButton', href=True)
and to get its href:
soup.find('a', class_='mcnButton', href=True).get('href')
For more information check the docs

Related

How to take hyperlinks from a wikipedia page

My current project requires to obtain the summaries of some wikipedia pages. This is really easy to do, but I want to make a general script for that. More specifically I also want to obtain the summaries of hyperlinks. For example I want to get the summary of this page: https://en.wikipedia.org/wiki/Creative_industries (this is easy). Moreover, I would also like to get the summaries of the hyperlinks in Section: Definitions -> 'Advertising', 'Marketing', 'Architecture',...,'visual arts'. My problem is that some of these hyperlinks have different page names. For example, the previous mentioned page has the hyperlink 'Software' (number 6), but I want the summary of the page, which is 'Software Engineering'.
Can someone help me with that? I can find the summaries of the pages with the same hyperlink name, but that is not always the case. So basically I am looking for a way to use (page.links) to only one area of the page.
Thank you in advance
Try using Beautiful soup, this will print all the links with the given prefix
from bs4 import BeautifulSoup
import requests, re
''' Dont forget to install/setup package = 'lxml' '''
url = "your link"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'lxml')
tags = soup.find_all('a')
''' This will print every available link'''
for tag in tags:
print(tag.get('href'))
''' this will print links with only prefix as given'''
for link in soup.find_all('a',attrs={'href': re.compile("^{{you prefix here}}")}):
print(link.get('href')

How to distinguish two elements with the same class name

I've tried to target a.nostyle in my code, however when I do so, it will sometimes grab the email above as they share the same tags. I can't seem to find any tags unique to the phone number. How would you go about doing so?
SEE IMAGE BELOW. Any help would be greatly appreciated.
You can try
a.nostyle:not([itemprop])
UPDATE
As it seem that BeautifulSoup doesn't support :not() syntax, you can try workaround
link = [link for link in soup.select('a.nostyle') if 'itemprop' not in link.attrs][0]
to select link with required class attribute which doesn't contain itemprop attribute (as email link)
You can make a list which contains all the "a" tags. Then you can target required tag by using index numbers
Example
allATagContainer = soup.findAll("a")
then you can use allATagContainer[1] to target second a tag.

Scraping pages with multiple parts with python

I want to scrape this site for a complete list of the teammates. I know how to do that with beautifoulsoup for the first page, but the results are broken in a lot of pages. Is there a way to scrape all of the parts?
Thank you!
https://www.transfermarkt.co.uk/yvon-mvogo/profil/spieler/147051
https://www.transfermarkt.co.uk/steve-von-bergen/profil/spieler/4793
https://www.transfermarkt.co.uk/scott-sutter/profil/spieler/34520
Above given are some links to the player profiles. You can open the page in BeautifulSoup and parse it to get all the links in it. Write a regular expression after to filter out only the links that satisfy the above pattern and write another function to extact information from profile pages
soup = BeautifulSoup(html_page,'html.parser')
for a in soup.find_all('a', href=True):
m = re.search('/[a-z\-]+/profil/spieler/[0-9]+', a['href'])
if m:
found = m.group(0)
print(found)
Output
/michael-frey/profil/spieler/147043
/yvon-mvogo/profil/spieler/147051
/scott-sutter/profil/spieler/34520
/leonardo-bertone/profil/spieler/194975
/steve-von-bergen/profil/spieler/4793
/alain-nef/profil/spieler/4945
/raphael-nuzzolo/profil/spieler/32574
/marco-wolfli/profil/spieler/4860
/moreno-costanzo/profil/spieler/41207
/jan-lecjaks/profil/spieler/62854
/alain-rochat/profil/spieler/4843
/christoph-spycher/profil/spieler/2871
/gonzalo-zarate/profil/spieler/52731
/christian-schneuwly/profil/spieler/52556
/yuya-kubo/profil/spieler/186260
/alexander-farnerud/profil/spieler/10255
/salim-khelifi/profil/spieler/147049
/alexander-gerndt/profil/spieler/45881
/adrian-winter/profil/spieler/59681
/victor-palsson/profil/spieler/97241
/milan-gajic/profil/spieler/46928
/dusan-veskovac/profil/spieler/28705
/marco-burki/profil/spieler/172192
/elsad-zverotic/profil/spieler/25542
/pa-modou/profil/spieler/66449
/yoric-ravet/profil/spieler/82461
You can loop through all the links and call a function that extracts the information that you require from the profile pages. Hope this helps
Use this link. I got it from inspecting the buttons
https://www.transfermarkt.co.uk/michael-frey/gemeinsameSpiele/spieler/147043/ajax/yw2/page/1
You can change the number at the end to get each page

What is the most efficient way to get a specific link using Beautiful Soup in Python 3.0?

I am currently learning Python specialization on coursera. I have come across the issue of extracting a specific link from a webpage using BeautifulSoup. From this webpage (http://py4e-data.dr-chuck.net/known_by_Fikret.html), I am supposed to extract a URL from user input and open that subsequent links, all identified through the anchor tab and run some number of iterations.
While I able to program them using Lists, I am wondering if there is any simpler way of doing it without using Lists or Dictionary?
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
tags = soup('a')
nameList=list()
loc=''
count=0
for tag in tags:
loc=tag.get('href',None)
nameList.append(loc)
url=nameList[pos-1]
In the above code, you would notice that after locating the links using 'a' tag and 'href', I cant help but has to create a list called nameList to locate the position of link. As this is inefficient, I would like to know if I could directly locate the URL without using the lists. Thanks in advance!
The easiest way is to get an element out of tags list and then extract href value:
tags = soup('a')
a = tags[pos-1]
loc = a.get('href', None)
You can also use soup.select_one() method to query :nth-of-type element:
soup.select('a:nth-of-type({})'.format(pos))
As :nth-of-type uses 1-based indexing, you don't need to subtract 1 from pos value if your users are expected to use 1-based indexing too.
Note that soup's :nth-of-type is not equivalent to CSS :nth-of-type pseudo-class, as it always selects only one element, while CSS selector may select many elements at once.
And if you're looking for "the most efficient way", then you need to look at lxml:
from lxml.html import fromstring
tree = fromstring(r.content)
url = tree.xpath('(//a)[{}]/#href'.format(pos))[0]

How does table parsing work in python? Is there an easy way other that beautiful soup?

I am trying to understand how one can use beautiful soup to extract the href links for the contents under a particular column in a table on a webpage. For example consider the link: http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015.
On this page the table with class wikitable has a column title, I need to extract the href links that are behind each of the values under the column title and put them in an excel sheet. What would be the best way to do this? I am having a little difficulty in understanding the beautiful soup table parsing documentation.
You don't really have to literally navigate the tree, you can simply try to see what identifies those lines.
Like in this example, the urls you are looking for reside in a table with class="wikitable", in that table they reside in a td tag with align=center, now we have a somewhat unique identification for our links, we can start extracting them.
However you should put into consideration that multiple tables with class="wikitable" and td tags with align=center may exist, in case you want the first or second table, it depends on your choice, you will have to add extra filters.
The code should look something like this for extracting all links from those tables:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))
There's one more thing to note here, notice the use of SoupStrainer, it's used to specify a filter for reading the content you want to process, it helps to speed the process, try to not use the parse_only argument on this line:
soup = BeautifulSoup(content, parse_only=filter_tag)
and notice the difference. (I noticed it because my pc is not that powerful)

Categories

Resources