BeautifulSoup - which URL

BeautifulSoup - which URL - python

I started a little project. I am trying to scrape the URL http://pr0gramm.com/
and save the tags under a picture in a variable, but I have problems to do so.
I am searching for this in the code
<a class="tag-link" href="/top/Flaschenkind">Flaschenkind</a>
And I actually just need the part "Flaschenkind" to be saved, but also the following tags in that line.
This is my code so far
import requests
from bs4 import BeautifulSoup
url = "http://pr0gramm.com/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
links = soup.find_all("div", {"class" : "item-tags"})
print(links)
I sadly just get this output
[]
I already tried to change the URL to http://pr0gramm.com/top/
but I get the same output. I wonder if it happens because the site might be made with JavaScript and it can't scrape the data correctly then?

The problem is - this is a dynamic site and all of the data you see is loaded via additional XHR calls to the website JSON API. You need to simulate that in your code.
Working example using requests:
from urllib.parse import urljoin
import requests
base_image_url = "http://img.pr0gramm.com"
with requests.Session() as session:
response = session.get("http://pr0gramm.com/api/items/get", params={"flags": 1, "promoted": "1"})
posts = response.json()["items"]
for post in posts:
image_url = urljoin(base_image_url, post["image"])
# get tags
response = session.get("http://pr0gramm.com/api/items/info", params={"itemId": post["id"]})
post_data = response.json()
tags = [tag["tag"] for tag in post_data["tags"]]
print(image_url, tags)
This would print the post image url as well as a list of post tags:
http://img.pr0gramm.com/2016/03/07/f693234d558334d7.jpg ['Datsun 1600 Wagon', 'Garage 88', 'Kombi', 'nur Oma liegt tiefer', 'rolladen', 'slow']
http://img.pr0gramm.com/2016/03/07/185544cda956679e.webm ['Danke Merkel', 'deeskalierte zeitnah', 'demokratie im endstadium', 'Fachkraft', 'Far Cry Primal', 'Invite is raus', 'typ ist nackt', 'VVS', 'webm', 'zeigt seine stange']
http://img.pr0gramm.com/2016/03/07/4a6719b33219fd87.jpg ['bmw', 'der Gerät', 'Drehmoment', 'für mehr Motorräder auf pr0', 'Motorrad']
...

First off your URL is a Java Script enabled version of this site. They offer a static URL as www.pr0gramm.com/static/ Here you'll find the content formatted more like your example suggests you expect.
Using this static version of the URL I retrieved <a> tags using the code below like yours. I removed the class tag filter. Python 2.7
import bs4
import urllib2
def main():
url = "http://pr0gramm.com/static/"
try:
fin = urllib2.urlopen(url)
except:
print "Url retrieval failed url:",url
return None
html = fin.read()
bs = bs4.BeautifulSoup(html,"html5lib")
links = bs.find_all("a")
print links
return None
if __name__ == "__main__":
main()

Related

How to webscrape old school website that uses frames

I am trying to webscrape a government site that uses frameset.
Here is the URL - https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm
I've tried using splinter/selenium
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
browser.visit(url)
time.sleep(10)
full_xpath_frame = '/html/frameset/frameset/frame[2]'
tree = browser.find_by_xpath(full_xpath_frame)
for i in tree:
print(i.text)
It just returns an empty string.
I've tried using the requests library.
import requests
from lxml import HTML
url = "https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/index.htm"
# get response object
response = requests.get(url)
# get byte string
data = response.content
print(data)
And it returns this
b"<html>\r\n<head>\r\n<meta http-equiv='Content-Type'\r\ncontent='text/html; charset=iso-
8859-1'>\r\n<title>Lake_ County Election Results</title>\r\n</head>\r\n<FRAMESET rows='20%,
*'>\r\n<FRAME src='titlebar.htm' scrolling='no'>\r\n<FRAMESET cols='20%, *'>\r\n<FRAME
src='menu.htm'>\r\n<FRAME src='Lake_ElecSumm_all.htm' name='reports'>\r\n</FRAMESET>
\r\n</FRAMESET>\r\n<body>\r\n</body>\r\n</html>\r\n"
I've also tried using beautiful soup and it gave me the same thing. Is there another python library I can use in order to get the data that's inside the second table?
Thank you for any feedback.

As mentioned you could go for the frames and its src:
BeautifulSoup(r.text).select('frame')[1].get('src')
or directly to the menu.htm:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults/menu.htm')
link_list = ['https://lakecounty.in.gov/departments/voters/election-results-c/2022GeneralElectionResults'+a.get('href') for a in BeautifulSoup(r.text).select('a')]
for link in link_list[:1]:
r = requests.get(link)
soup = BeautifulSoup(r.text)
###...scrape what is needed

how to remove unwanted text from retrieving title of a page using python

Hi All I have written a python program to retrieve the title of a page it works fine but with some pages, it also receives some unwanted text how to avoid that
here is my program
# importing the modules
import requests
from bs4 import BeautifulSoup
# target url
url = 'https://atlasobscura.com'
# making requests instance
reqs = requests.get(url)
# using the BeaitifulSoup module
soup = BeautifulSoup(reqs.text, 'html.parser')
# displaying the title
print("Title of the website is : ")
for title in soup.find_all('title'):
title_data = title.get_text().lower().strip()
print(title_data)
here is my output
atlas obscura - curious and wondrous travel destinations
aoc-full-screen
aoc-heart-solid
aoc-compass
aoc-flipboard
aoc-globe
aoc-pocket
aoc-share
aoc-cancel
aoc-video
aoc-building
aoc-clock
aoc-clipboard
aoc-help
aoc-arrow-right
aoc-arrow-left
aoc-ticket
aoc-place-entry
aoc-facebook
aoc-instagram
aoc-reddit
aoc-rss
aoc-twitter
aoc-accommodation
aoc-activity-level
aoc-add-a-photo
aoc-add-box
aoc-add-shape
aoc-arrow-forward
aoc-been-here
aoc-chat-bubbles
aoc-close
aoc-expand-more
aoc-expand-less
aoc-forum-flag
aoc-group-size
aoc-heart-outline
aoc-heart-solid
aoc-home
aoc-important
aoc-knife-fork
aoc-library-books
aoc-link
aoc-list-circle-bullets
aoc-list
aoc-location-add
aoc-location
aoc-mail
aoc-map
aoc-menu
aoc-more-horizontal
aoc-my-location
aoc-near-me
aoc-notifications-alert
aoc-notifications-mentions
aoc-notifications-muted
aoc-notifications-tracking
aoc-open-in-new
aoc-pencil
aoc-person
aoc-pinned
aoc-plane-takeoff
aoc-plane
aoc-print
aoc-reply
aoc-search
aoc-shuffle
aoc-star
aoc-subject
aoc-trip-style
aoc-unpinned
aoc-send
aoc-phone
aoc-apps
aoc-lock
aoc-verified
instead of this I suppose to receive only this line
"atlas obscura - curious and wondrous travel destinations"
please help me with some idea all other websites are working only some websites gives these problem

Your problem is that you're finding all the occurences of "title" in the page. Beautiful soup has an attribute title specifically for what you're trying to do. Here's your modified code:
# importing the modules
import requests
from bs4 import BeautifulSoup
# target url
url = 'https://atlasobscura.com'
# making requests instance
reqs = requests.get(url)
# using the BeaitifulSoup module
soup = BeautifulSoup(reqs.text, 'html.parser')
title_data = soup.title.text.lower()
# displaying the title
print("Title of the website is : ")
print(title_data)

Beautifulsoup "findAll()" does not return the tags

I am trying to build a scraper to get some abstracts of academic papers and their corresponding titles on this page.
The problem is that my for link in bsObj.findAll('a',{'class':'search-track'}) does not return the links I need to further build my scraper. In my code, the check is like this:
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
The for loop above does print out anything, however, the href links should be inside the <a class="search-track" ...</a>.
I have referred to this post, but changing the Beautifulsoup parser is not solving the problem of my code. I am using "html.parser" in my Beautifulsoup constructor: bsObj = bs(html.content, features="html.parser").
And the print(len(bsObj)) prints out "3" while it prints out "2" for both "lxml" and "html5lib".
Also, I started off using urllib.request.urlopen to get the page and then tried requests.get() instead. Unfortunately the two approaches give me the same bsObj.
Here is the code I've written:
#from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bs
import ssl
'''
The elsevier search is kind of a tree structure:
"keyword --> a list of journals (a journal contain many articles) --> lists of articles
'''
address = input("Please type in your keyword: ") #My keyword is catalyst for water splitting
#https://www.elsevier.com/en-xs/search-results?
#query=catalyst%20for%20water%20splitting&labels=journals&page=1
address = address.replace(" ", "%20")
address = "https://www.elsevier.com/en-xs/search-results?query=" + address + "&labels=journals&page=1"
journals = []
articles = []
def getJournals(url):
global journals
#html = urlopen(url)
html = requests.get(url)
bsObj = bs(html.content, features="html.parser")
#print(len(bsObj))
#testFile = open('testFile.txt', 'wb')
#testFile.write(bsObj.text.encode(encoding='utf-8', errors='strict') +'\n'.encode(encoding='utf-8', errors='strict'))
#testFile.close()
for link in bsObj.findAll('a',{'class':'search-track'}):
print(link)
########does not print anything########
'''
if 'href' in link.attrs and link.attrs['href'] not in journals:
newJournal = link.attrs['href']
journals.append(newJournal)
'''
return None
# Ignore SSL certificate errors
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
getJournals(address)
print(journals)
Can anyone tell me what the problem is in my code that the for loop does not print out any links? I need to store the links of journals in a list and then visit each link to scrape the abstracts of papers. By right the abstracts part of a paper is free and the website shouldn't have blocked my ID because of it.

This page is dynamically loaded with jscript, so Beautifulsoup can't handle it directly. You may be able to do it using Selenium, but in this case you can do it by tracking the api calls made by the page (for more see, as one of many examples, here.
In your particular case it can be done this way:
from bs4 import BeautifulSoup as bs
import requests
import json
#this is where the data is hiding:
url = "https://site-search-api.prod.ecommerce.elsevier.com/search?query=catalyst%20for%20water%20splitting&labels=journals&start=0&limit=10&lang=en-xs"
html = requests.get(url)
soup = bs(html.content, features="html.parser")
data = json.loads(str(soup))#response is in json format so we load it into a dictionary
Note: in this case, it's also possible to dispense with Beautifulsoup altogether and load the response directly, as in data = json.loads(html.content). From this point:
hits = data['hits']['hits']#target urls are hidden deep inside nested dictionaries and lists
for hit in hits:
print(hit['_source']['url'])
Ouput:
https://www.journals.elsevier.com/water-research
https://www.journals.elsevier.com/water-research-x
etc.

Can't fetch links connected to different exhibitors from a webpage

I've been trying to fetch the links connected to different exhibitors from this webpage using python script but I get nothing as result, no error either. The class name m-exhibitors-list__items__item__name__link I've used within my script is available in the page source so they are not generated dynamically.
What change should I bring about within my script to get the links?
This is what I've tried with:
from bs4 import BeautifulSoup
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
with requests.Session() as s:
s.headers['User-Agent']='Mozilla/5.0'
response = s.get(link)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("a.m-exhibitors-list__items__item__name__link"):
print(item.get("href"))
One such links I'm after (the first one):
https://www.topdrawer.co.uk/exhibitors/alessi-1

#Life is complex is right that site you used to scrape is protected by Incapsula service to protect site from web scraping and other attacks, it checks for request header whether it is from browser or from robot(you or bot), However it is more likely site has proprietary data, or they might preventing from other threats
However there is option to achieve what you want using Selenium and BS4
following is code snip for your reference
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
response = wd.get(link)
html_page = wd.page_source
soup = BeautifulSoup(html_page,"lxml")
results = soup.findAll("a", {"class" : "m-exhibitors-list__items__item__name__link"})
#interate list of anchor tags to get href attribute
for item in results:
print(item.get("href"))
wd.quit()

The site that you are attempting to scrape is protected with Incapsula.
target_url = 'https://www.topdrawer.co.uk/exhibitors?page=1'
response = requests.get(target_url,
headers=http_headers, allow_redirects=True, verify=True, timeout=30)
raw_html = response.text
soupParser = BeautifulSoup(raw_html, 'lxml')
pprint (soupParser.text)
**OUTPUTS**
soupParser = BeautifulSoup(raw_html, 'html')
('Request unsuccessful. Incapsula incident ID: '
'438002260604590346-1456586369751453219')
Read through this: https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula
and these: https://stackoverflow.com/search?q=Incapsula

using python to pull href tags

tyring to pull the href links for the products on this webpage. The code pulls all of the href's except the products that are listed on the page.
from bs4 import BeautifulSoup
import requests
url = "https://www.neb.com/search#t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
print(tag.get('href'))

The products are loaded through rest API dynamically, the URL is this:
https://international.neb.com/coveo/rest/v2/?sitecoreItemUri=sitecore%3A%2F%2Fweb%2F%7BA1D9D237-B272-4C5E-A23F-EC954EB71A26%7D%3Flang%3Den%26ver%3D1&siteName=nebinternational
Loading this response will get you the URLs.
Next time, check your network inspector if any part of web page isn't loading dynamically (or use selenium).

Try to verify if the product href's is in the received response. I'm telling you to do this because if the part of the products is being dynamically generated by ajax, for example, a simple get on the main page will not bring them.
Print the response and verifiy if the products are being received in the html

I think you want something like this:
from bs4 import BeautifulSoup
import urllib.request
for numb in ('1', '100'):
resp = urllib.request.urlopen("https://www.neb.com/search#first=" + numb + "&t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))
for link in soup.find_all('a', href=True):
print(link['href'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

BeautifulSoup - which URL - python

Related

How to webscrape old school website that uses frames

how to remove unwanted text from retrieving title of a page using python

Beautifulsoup "findAll()" does not return the tags

Can't fetch links connected to different exhibitors from a webpage

using python to pull href tags

Categories

Resources