Incomplete parsing of Pinterest using urllib, requests and selenium

Incomplete parsing of Pinterest using urllib, requests and selenium - python

I have tried parsing the following Pinterest page using urllib, requests, and chromedriver:
https://www.pinterest.com/pin/463237511669606028/
But it looks like some sections of the page are missing in my result. Specifically, I'm trying to parse the number of re-pins (below the comments), which I can't.
I have tried both of these options but userActivity class is not part of what I get:
driver.get("https://www.pinterest.com/pin/463237511669606028/")
html = driver.page_source
soup = BeautifulSoup(html, features="html.parser")
and
req = urllib2.Request("https://www.pinterest.com/pin/463237511669606028/",
headers={'User-Agent': "PyBrowser"})
con = urllib2.urlopen(req)
content = con.read()
soup = BeautifulSoup(content,features="html.parser")
Any ideas?

Related

Can't fetch links connected to different exhibitors from a webpage

I've been trying to fetch the links connected to different exhibitors from this webpage using python script but I get nothing as result, no error either. The class name m-exhibitors-list__items__item__name__link I've used within my script is available in the page source so they are not generated dynamically.
What change should I bring about within my script to get the links?
This is what I've tried with:
from bs4 import BeautifulSoup
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
with requests.Session() as s:
s.headers['User-Agent']='Mozilla/5.0'
response = s.get(link)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("a.m-exhibitors-list__items__item__name__link"):
print(item.get("href"))
One such links I'm after (the first one):
https://www.topdrawer.co.uk/exhibitors/alessi-1

#Life is complex is right that site you used to scrape is protected by Incapsula service to protect site from web scraping and other attacks, it checks for request header whether it is from browser or from robot(you or bot), However it is more likely site has proprietary data, or they might preventing from other threats
However there is option to achieve what you want using Selenium and BS4
following is code snip for your reference
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
response = wd.get(link)
html_page = wd.page_source
soup = BeautifulSoup(html_page,"lxml")
results = soup.findAll("a", {"class" : "m-exhibitors-list__items__item__name__link"})
#interate list of anchor tags to get href attribute
for item in results:
print(item.get("href"))
wd.quit()

The site that you are attempting to scrape is protected with Incapsula.
target_url = 'https://www.topdrawer.co.uk/exhibitors?page=1'
response = requests.get(target_url,
headers=http_headers, allow_redirects=True, verify=True, timeout=30)
raw_html = response.text
soupParser = BeautifulSoup(raw_html, 'lxml')
pprint (soupParser.text)
**OUTPUTS**
soupParser = BeautifulSoup(raw_html, 'html')
('Request unsuccessful. Incapsula incident ID: '
'438002260604590346-1456586369751453219')
Read through this: https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula
and these: https://stackoverflow.com/search?q=Incapsula

How to grab titles from webpages using Beautiful Soup in Python and iterating through

I am using bs4 in python to parse web pages and get information. I am having trouble grabbing just the title. Another part I struggled with was following the links, should this be done recursively or would I be able to do it through a loop?
def getTitle(link):
resp = urllib.request.urlopen(link)
soup = BeautifulSoup(resp, 'html.parser')
print(soup.find("<title>"))

from bs4 import BeautifulSoup
import urllib
def getTitle(link):
resp = urllib.request.urlopen(link)
soup = BeautifulSoup(resp, 'html.parser')
return soup.title.text
print(getTitle('http://www.bbc.co.uk/news'))
Which displays:
Home - BBC News

How to get text following a table/span with BeautifulSoup and Python?

I need to get the text 2,585 shown in the screenshot below. I very new to coding, but this is what i have so far:
import urllib2
from bs4 import BeautifulSoup
url= 'insertURL'
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
span = soup.find('span', id='d21475972e793-wk-Fact -8D34B98C76EF518C788A2177E5B18DB0')
print (span.text)
Any info is helpful!! Thanks.
Website HTML

3 things, your using requests not urllib2. Your selecting XML with namespaces so you need to use xml as the parser. The element you want is not span it is ix:nonFraction. Here is a working example using another web-page (you just need to point it at your page and use the commented line).
# Using requests no need for urllib2.
import requests
from bs4 import BeautifulSoup
# Using this page as an example.
url= 'https://www.sec.gov/Archives/edgar/data/27904/000002790417000004/0000027904-17-000004.txt'
r = requests.get(url)
data = r.text
# use xml as the parser.
soup = BeautifulSoup(data, 'xml')
ix = soup.find('ix:nonFraction', id="Fact-7365D69E1478B0A952B8159A2E39B9D8-wk-Fact-7365D69E1478B0A952B8159A2E39B9D8")
# Your original code for your page.
# ix = soup.find('ix:nonFraction', id='d21475972e793-wk-Fact-8D34B98C76EF518C788A2177E5B18DB0')
print (ix.text)

Use BeautifulSoup to parse HTML but gets stuck in creating BeatuifulSoup object

html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html5lib')
After running several times, the process gets stuck at BeautifulSoup(html.read(), 'html5lib'), and I have tried to change from html parser to 'lxml' and 'html.parser'. However, the problem persists. Is there a bug in BeautifulSoup? How can I solve this problem?
update
I add some logs inside the program, like this
print('open the url')
html = urlopen(url)
print('create BeautifulSoup Object')
bs = BeautifulSoup(html.read(), 'html5lib')
The console print create BeautifulSoup Object and just stay there with a blinking cursor.

I've encountered the same problem and I found out that the program got stuck at html.read(), which may because urlopen() resource does not close correctly when the response has some errors.
You can change like this:
with urlopen(url) as html:
html = html.read()
bs = BeautifulSoup(html, "lxml")
Or you can choose to use the requests package, which is better than the urllib like this:
import requests
html = requests.get(url).text
bs = BeautifulSoup(html, "lxml")
Hope it can solve your problem

bs4 the second comment <!-- > is missing

I am doing python challenge level-9 with BeautifulSoup.
url = "http://www.pythonchallenge.com/pc/return/good.html".
bs4.version == '4.3.2'.
There are two comments in its page source. The output of soup should be as follows.
However, when BeautifulSoup is applied, the second comment is missing.
It seems kinda weird. Any hint? Thanks!
import requests
from bs4 import BeautifulSoup
url = "http://www.pythonchallenge.com/pc/return/good.html"
page = requests.get(url, auth = ("huge", "file")).text
print page
soup = BeautifulSoup(page)
print soup

Beautiful Soup is a wrapper around an html parser. The default parser is very strict, and when it encounters malformed html silently drops the elements it had trouble with.
You should instead install the package 'html5lib' and use that as your parser, like so:
soup = BeautifulSoup(page, 'html5lib')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Incomplete parsing of Pinterest using urllib, requests and selenium - python

Related

Can't fetch links connected to different exhibitors from a webpage

How to grab titles from webpages using Beautiful Soup in Python and iterating through

How to get text following a table/span with BeautifulSoup and Python?

Use BeautifulSoup to parse HTML but gets stuck in creating BeatuifulSoup object

bs4 the second comment <!-- > is missing

Categories

Resources