Beautfil Soup Error with simple script

Beautfil Soup Error with simple script - python

I am running Beautiful Soup 4.5 with Python 3.4 on Windows 7. Here is my script:
from bs4 import BeautifulSoup
import urllib3
http = urllib3.PoolManager()
url = 'https://scholar.google.com'
response = http.request('GET', url)
html2 = response.read()
soup = BeautifulSoup([html2])
print (type(soup))
Here is the error I am getting:
TypeError: Expected String or Buffer
I have researched and there seem to be no fixes except going to an older version of Beautiful Soup which I don't want to do. Any help would be much appreciated.

Not sure why are you putting the html string into the list here:
soup = BeautifulSoup([html2])
Replace it with:
soup = BeautifulSoup(html2)
Or, you can also pass the response file-like object, BeautifulSoup would read it for you:
response = http.request('GET', url)
soup = BeautifulSoup(response)
It is also a good idea to specify a parser explicitly:
soup = BeautifulSoup(html2, "html.parser")

Related

Extracting json when web scraping

I was following a python guide on web scraping and there's one line of code that won't work for me. I'd appreciate it if anybody could help me figure out what the issue is, thanks.
from bs4 import BeautifulSoup
import json
import re
import requests
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
page = requests.get(url)
soup = BeautifulSoup(page.content, "lxml")
script = soup.find('script', text=re.compile('root\.App\.main'))
json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
Error Message:
json_text = re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1)
AttributeError: 'NoneType' object has no attribute 'string'
Link to the guide I was looking at: https://www.mattbutton.com/how-to-scrape-stock-upgrades-and-downgrades-from-yahoo-finance/

Main issue in my opinion is that you should add an user-agent to your request, so that you get expected HTML:
headers = {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)
Note: Almost and first at all - Take a deeper look into your soup, to check if expected information is available.
Example
import re
import json
from bs4 import BeautifulSoup
url = 'https://finance.yahoo.com/quote/AAPL/analysis?p=AAPL'
headers = {'user-agent':'Mozilla/5.0'}
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content)
script = soup.find('script', text=re.compile('root\.App\.main'))
json_text = json.loads(re.search(r'^\s*root\.App\.main\s*=\s*({.*?})\s*;\s*$',script.string, flags=re.MULTILINE).group(1))
json_text

Incomplete parsing of Pinterest using urllib, requests and selenium

I have tried parsing the following Pinterest page using urllib, requests, and chromedriver:
https://www.pinterest.com/pin/463237511669606028/
But it looks like some sections of the page are missing in my result. Specifically, I'm trying to parse the number of re-pins (below the comments), which I can't.
I have tried both of these options but userActivity class is not part of what I get:
driver.get("https://www.pinterest.com/pin/463237511669606028/")
html = driver.page_source
soup = BeautifulSoup(html, features="html.parser")
and
req = urllib2.Request("https://www.pinterest.com/pin/463237511669606028/",
headers={'User-Agent': "PyBrowser"})
con = urllib2.urlopen(req)
content = con.read()
soup = BeautifulSoup(content,features="html.parser")
Any ideas?

Get the latest XML file from a HTTPS

I have a series of XML files at a HTTPS URL below. I need to get the latest XML file from the URL.
I tried to modify this piece of code but does not work. Please help.
from bs4 import BeautifulSoup
import urllib.request
import requests
url = 'https://www.oasis.oati.com/cgi-bin/webplus.dll?script=/woa/woa-planned-outages-report.html&Provider=MISO'
response = requests.get(url, verify=False)
#html = urllib.request.urlopen(url,verify=False)
soup = BeautifulSoup(response)
I suppose beautifulsoup does not read response object. And if I use the urlopen function, it throws SSL error.

BeautifulSoup does not understand the requests's Response instances directly - grab .content and pass it to the "soup" to parse:
soup = BeautifulSoup(response.content, "html.parser") # you can also use "lxml" or "html5lib" instead of "html.parser"
BeautifulSoup understands the "file-like" objects as well - which means that, once you figure out your SSL error issue you can do:
data = urllib.request.urlopen(url)
soup = BeautifulSoup(data, "html.parser")

I did not frame my question correctly in the first place. But after furthering researching, I found out that I was really trying to extract all the URLs within the referenced url tags. With some more background of the Beautiful Soup, I would use soup.find_all('a').

Web crawler does not print

I'm working on my first web crawler and I cannot figure out how to get it to print results. There is not an error, but nothing displays.
from bs4 import BeautifulSoup
import urllib3
def extract_links():
http = urllib3.PoolManager()
r = http.request('GET', 'http://www.drankbank.com/happy-hour-chicago.html')
soup = BeautifulSoup(r, 'html.parser')
print(soup)
extract_links()
Thank you!

You are not accessing the data returned in the request.
soup = BeautifulSoup(r, 'html.parser')
should be:
soup = BeautifulSoup(r.data, 'html.parser')

re.search and urlopen in Python

I have this script :
for url in urls:
u = urlopen(url).read
owner_id = re.search(r'ownerId: ([1-9]+)?,', u).group(1)
id = re.search(r'id: ([1-9]+)?,', u).group(1)
print(owner_id)
print(id)
url is a list of urls
The script returns me "TypeError: expected string or bytes-like object"
Do you have an idea how to fix that ?

Not sure what version of Python your using (below is for v3+, for v2, replace urllib with urllib2).
need to import urllib and beautiful soup
import urllib
from bs4 import BeautifulSoup
url = "url address"
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautfil Soup Error with simple script - python

Related

Extracting json when web scraping

Incomplete parsing of Pinterest using urllib, requests and selenium

Get the latest XML file from a HTTPS

Web crawler does not print

re.search and urlopen in Python

Categories

Resources