Scrape all href into list with BeautifulSoup

Scrape all href into list with BeautifulSoup - python

I'd like to to grab links from this page and put them in a list.
I have this code:
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('http://www.gcoins.net/en/catalog/236').read()
soup = bs.BeautifulSoup(source,'lxml')
links = soup.find_all('a', attrs={'class': 'view'})
print(links)
It produces following output:
[<a class="view" href="/en/catalog/view/514">
<img alt="View details" height="32" src="/img/actions/file.png" title="View details" width="32"/>
</a>,
"""There are 28 lines more"""
<a class="view" href="/en/catalog/view/565">
<img alt="View details" height="32" src="/img/actions/file.png" title="View details" width="32"/>
</a>]
I need to get following: [/en/catalog/view/514, ... , '/en/catalog/view/565']
But then I go ahead and add following: href_value = links.get('href') I got an error.

Try:
soup = bs.BeautifulSoup(source,'lxml')
links = [i.get("href") for i in soup.find_all('a', attrs={'class': 'view'})]
print(links)
Output:
['/en/catalog/view/514', '/en/catalog/view/515', '/en/catalog/view/179080', '/en/catalog/view/45518', '/en/catalog/view/521', '/en/catalog/view/111429', '/en/catalog/view/522', '/en/catalog/view/182223', '/en/catalog/view/168153', '/en/catalog/view/523', '/en/catalog/view/524', '/en/catalog/view/60228', '/en/catalog/view/525', '/en/catalog/view/539', '/en/catalog/view/540', '/en/catalog/view/31642', '/en/catalog/view/553', '/en/catalog/view/558', '/en/catalog/view/559', '/en/catalog/view/77672', '/en/catalog/view/560', '/en/catalog/view/55377', '/en/catalog/view/55379', '/en/catalog/view/32001', '/en/catalog/view/561', '/en/catalog/view/562', '/en/catalog/view/72185', '/en/catalog/view/563', '/en/catalog/view/564', '/en/catalog/view/565']

Your links is currently a python list. What you want to do is loop into that list and fetch the hrefs as below.
final_hrefs = []
for each_link in links:
final_hrefs.append(each_link.a['href'])
or a one-liner
final_hrefs = [each_link['href'] for each_link in links]
print(final_hrefs)

Try the code below. You get the HTML list in one step:
import bs4 as bs
import urllib.request
source = urllib.request.urlopen('http://www.gcoins.net/en/catalog/236').read()
soup = bs.BeautifulSoup(source,'lxml')
links = [i.get("href") for i in soup.find_all('a', attrs={'class': 'view'})]
for link in links:
print('http://www.gcoins.net'+ link)

Related

How can I narrow down my soup.find search to one line in the HTML

I'm trying to scrape one line of html from this:
<strong class="listingPrice">
£75- £85
<abbr title="">pw</abbr>
</strong>
The line I'm trying to scrape is "£75- £85"
My current code to scrape the page is:
html_text = requests.get("web address").text
soup = BeautifulSoup(html_text, 'lxml')
prices = soup.find_all('strong', class_='listingPrice')
Any advice?

You can use .contents[0]:
from bs4 import BeautifulSoup
doc = """<strong class="listingPrice">
£75- £85
<abbr title="">pw</abbr>
</strong>"""
soup = BeautifulSoup(doc, "lxml")
price = soup.select_one(".listingPrice").contents[0].strip()
print(price)
Prints:
£75- £85
Or .find_next() with text=True:
price = soup.select_one(".listingPrice").find_next(text=True).strip()

How to take link from onclickvalue in BeautifulSoup?

Need help scrubbing a link to an image that is stored in the onclick= value.
I do this, but I stopped how to remove everything in onclick except for the link.
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a', attrs={'onclick': re.compile("^https://")})
But nothing is output.
links = soup.find('div', class_='workshopItemPreviewImageMain')
links = links.findChild('a')
links = links.get("onclick")
The entire value of onclick is displayed:
howEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' )
But only a link is needed.

You just need to change your regular expression.
from bs4 import BeautifulSoup
import re
pattern = re.compile(r'''(?P<quote>['"])(?P<href>https?://.+?)(?P=quote)''')
data = '''
<div class="workshopItemPreviewImageMain">
<a onclick="ShowEnlargedImagePreview( 'https://steamuserimages-a.akamaihd.net/ugc/794261971268711656/69C39CF2A2BBCDDC7C04C17DF1E88A6ED875DBE7/' );"></a>
</div>
'''
soup = BeautifulSoup(data, 'html.parser')
div = soup.find('div', class_='workshopItemPreviewImageMain')
links = div.find_all('a', {'onclick': pattern})
for a in links:
print(pattern.search(a['onclick']).group('href'))

how to extract a href content from a website using BeautifulSoup package in python

I have the following example
<h2 class="m0 t-regular">
<a data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/" data-job-id="4276199">
Executive Chef </a>
</h2>
How to find the "a" tag ??
Until now it return empty result:
import time
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
"lxml"
)
follow_links = [
a["href"] for a in
soup.find_all("h2", class_="m0 t-regular")
if "#" not in a["href"]
]
print(follow_links)
result :
[]
Question is how to return the link ?

You are close to it, use ['href'] to get the url.
Example
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
"lxml"
)
links = []
for a in soup.select("h2.m0.t-regular a"):
if a['href'] not in links:
links.append(a['href'])
links

To get href link you need this code:
follow_links = [p.a["href"] for p in soup.find_all("h2", class_="m0 t-regular") if "#" not in p.a["href"]]
add "https://www.bayt.com/" if you don't want just href="/en/qatar/jobs/executive-chef-4276199/"
follow_links = ["https://www.bayt.com/"+p.a["href"] for p in soup.find_all("h2", class_="m0 t-regular") if "#" not in p.a["href"]]

You catched
<h2 class="m0 t-regular">
<a data-job-id="4276199" data-js-aid="jobID" data-js-link="" href="/en/qatar/jobs/executive-chef-4276199/">
Executive Chef </a>
</h2>
by using soup.find_all("h2", class_="m0 t-regular") per iteration. So you need to catch here a tag then catch 'href' attribute.
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,"lxml")
# my solution
links = soup.select('h2.m0.t-regular')
for link in links:
print(link.a['href'])
print(soup.find_all("h2", class_="m0 t-regular")[0])
follow_links = [
tag_a.a["href"] for tag_a in
soup.find_all("h2", class_="m0 t-regular")
if "#" not in tag_a.a["href"]
]
print(follow_links)

Try this to get you href :
follow_links=soup.find_all('your class a')
for link in follow_links: #Then you can process it with something like:
if "#" not in link.a['href']:
follow_links + [link.a["href"]]

As per your code, you are extracting h2 tags, you should get the next tag of h2 i.e a tag and from there you can get only a tags which have a href
import time
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
"lxml"
)
follow_links = [a.find_next('a')['href'] for a in soup.find_all("h2", class_="m0 t-regular")]

Getting text between HTML <a> tag with beautifull soup

I'm trying to scrape all movie names from a website.
url = 'https://www.boxofficemojo.com/year/world/2019/'
content = session.get(url, verify=False).content
soup = BeautifulSoup(content, "html.parser")
movie = soup.find('a', {'class': 'a-link-normal'})
print(movie)
With this code I get the following result
<a class="a-link-normal" href="/?ref_=bo_nb_ydw_mojologo"></a>
However when I inspect the page I get the result below. The text between the 'a' tag is what I need.
<a class="a-link-normal" href="/releasegroup/gr3511898629/?ref_=bo_ydw_table_1">Avengers: Endgame</a>
How do I retrieve it?

movie = soup.find('td', class_='a-text-left mojo-field-type-release_group')
print(movie.text)

Beautifulsoup: parsing html – get part of href

I'm trying to parse
<td height="16" class="listtable_1">76561198134729239</td>
for the 76561198134729239. and I can't figure out how to do it. what I tried:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find("td",
{
"class":"listtable_1",
"target":"_blank"
})
print(element.text)

There are many such entries in that HTML. To get all of them you could use the following:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
soup = BeautifulSoup(r.content, "html.parser")
for td in soup.findAll("td", class_="listtable_1"):
for a in td.findAll("a", href=True, target="_blank"):
print(a.text)
This would then return:
76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

"target":"_blank" is a class of anchor tag a within the td tag. It's not a class of td tag.
You can get it like so:
from bs4 import BeautifulSoup
html="""
<td height="16" class="listtable_1">
<a href="http://steamcommunity.com/profiles/76561198134729239" target="_blank">
76561198134729239
</a>
</td>"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find('td', {'class': "listtable_1"}).find('a', {"target":"_blank"}).text)
Output:
76561198134729239

As others mentioned you are trying to check attributes of different elements in a single find(). Instead, you can chain find() calls as MYGz suggested, or use a single CSS selector:
soup.select_one("td.listtable_1 a[target=_blank]").get_text()
If, you need to locate multiple elements this way, use select():
for elm in soup.select("td.listtable_1 a[target=_blank]"):
print(elm.get_text())

"class":"listtable_1" belong to td tag and target="_blank" belong to a tag, you should not use them together.
you should use Steam Community as an anchor to find the numbers after it.
OR use URL, The URL contain the info you need and it's easy to find, you can find the URL and split it by /:
for a in soup.find_all('a', href=re.compile(r'steamcommunity')):
num = a['href'].split('/')[-1]
print(num)
Code:
import requests
from lxml import html
from bs4 import BeautifulSoup
r = requests.get("http://ppm.rep.tf/index.php?p=banlist&page=154")
content = r.content
soup = BeautifulSoup(content, "html.parser")
for td in soup.find_all('td', string="Steam Community"):
num = td.find_next_sibling('td').text
print(num)
out:
76561198143466239
76561198094114508
76561198053422590
76561198066478249
76561198107353289
76561198043513442
76561198128253254
76561198134729239
76561198003749039
76561198091968935
76561198071376804
76561198068375438
76561198039625269
76561198135115106
76561198096243060
76561198067255227
76561198036439360
76561198026089333
76561198126749681
76561198008927797
76561198091421170
76561198122328638
76561198104586244
76561198056032796
76561198059683068
76561197995961306
76561198102013044

You could chain together two finds in gazpacho to solve this problem:
from gazpacho import Soup
html = """<td height="16" class="listtable_1">76561198134729239</td>"""
soup = Soup(html)
soup.find("td", {"class": "listtable_1"}).find("a", {"target": "_blank"}).text
This outputs:
'76561198134729239'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape all href into list with BeautifulSoup - python

Your links is currently a python list. What you want to do is loop into that list and fetch the hrefs as below. final_hrefs = [] for each_link in links: final_hrefs.append(each_link.a['href']) or a one-liner final_hrefs = [each_link['href'] for each_link in links] print(final_hrefs)

Related

How can I narrow down my soup.find search to one line in the HTML

How to take link from onclickvalue in BeautifulSoup?

how to extract a href content from a website using BeautifulSoup package in python

Getting text between HTML <a> tag with beautifull soup

Beautifulsoup: parsing html – get part of href

Categories

Resources