BeautifulSoup responses with error - python

I am trying to get my feet wet with BS.
I tried to work my way through the documentation butat the very first step I encountered already a problem.
This is my code:
from bs4 import BeautifulSoup
soup = BeautifulSoup('https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5....1b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description')
print(soup.prettify())
This is the response I get:
Warning (from warnings module):
File "/Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages/bs4/__init__.py", line 189
'"%s" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an
HTTP client to get the document behind the URL, and feed that document to Beautiful Soup.' % markup)
UserWarning: "https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5...b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description"
looks like a URL. Beautiful Soup is not an HTTP client. You should
probably use an HTTP client to get the document behind the URL, and feed that document
to Beautiful Soup.
https://api.flickr.com/services/rest/?method=flickr.photos.search&api;_key=5...b&per;_page=250&accuracy;=1&has;_geo=1&extras;=geo,tags,views,description
Is it because I try to call http**s** or is it another problem?
Thanks for your help!

You are passing URL as a string. Instead you need to get the page source via urllib2 or requests:
from urllib2 import urlopen # for Python 3: from urllib.request import urlopen
from bs4 import BeautifulSoup
URL = 'https://api.flickr.com/services/rest/?method=flickr.photos.search&api_key=5....1b&per_page=250&accuracy=1&has_geo=1&extras=geo,tags,views,description'
soup = BeautifulSoup(urlopen(URL))
Note that you don't need to call read() on the result of urlopen(), BeautifulSoup allows the first argument to be a file-like object, urlopen() returns a file-like object.

The error says everything, you are passing a URL to Beautiful Soup. You need to first get the website content, and only then pass the content to BS.
To download content you can use urlib2
import urllib2
response = urllib2.urlopen('http://www.example.com/')
html = response.read()
and later
soup = BeautifulSoup(html)

Related

Read data from URL / XML with python

this is my first question.
Im trying to learn some python, so.. i have this problem
how i can get data from this url that shows info in XML:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/v01/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup)
output:
Output
but i wanna get access to this type of data, < RUTEmisor> data for example:
linkurl_invoice
hope guys you can try to advice me with the code and how to read xml docs.
By examining the URL you gave, it seems that the data is actually held a few links away at the following URL: http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49
As such, you can access it directly as follows:
import requests
from bs4 import BeautifulSoup
url = 'http://windte1910.acepta.com/depot/A23D046FC1854B18399D5383F36923E25774179C?k=5121f909fd63e674149c0e42a9847b49'
document = requests.get(url)
soup = BeautifulSoup(document.content, "lxml-xml")
print(soup.find('RUTEmisor').text)

Problem finding elements by class with beautiful soup

I am trying to get the name of the events on this page, using beautiful soup 4 : https://www.orbitxch.com/customer/sport/1
I tried to filter the html code for tags with class="biab_item-link biab_market-link js-event-link biab_has-time", has it seemed to be the ones containing each unique event name once.
Here is my code
from bs4 import BeautifulSoup
import urllib3
http = urllib3.PoolManager()
url = 'https://www.orbitxch.com/customer/sport/1'
response = http.request('GET', url)
soup = BeautifulSoup(response.data, features="lxml")
for tag in soup.find_all("a", class_="biab_item-link biab_market-link js-event-link biab_has-time"):
print(tag["title"])
But nothing happens.
That's because html content is dynamically changed by javascript. Data came from this URL: https://www.orbitxch.com/customer/api/event-updates?eventIds=29108154,29106937,29096310,29096315,29106936,29096313,29096309,29096306,29107821,29108318,29106488,29106934,29106830,29106490,29104420 but honestly I don't know where can you find these IDs. This URL returns JSON response which you can easily parse using Python library.

Get the latest XML file from a HTTPS

I have a series of XML files at a HTTPS URL below. I need to get the latest XML file from the URL.
I tried to modify this piece of code but does not work. Please help.
from bs4 import BeautifulSoup
import urllib.request
import requests
url = 'https://www.oasis.oati.com/cgi-bin/webplus.dll?script=/woa/woa-planned-outages-report.html&Provider=MISO'
response = requests.get(url, verify=False)
#html = urllib.request.urlopen(url,verify=False)
soup = BeautifulSoup(response)
I suppose beautifulsoup does not read response object. And if I use the urlopen function, it throws SSL error.
BeautifulSoup does not understand the requests's Response instances directly - grab .content and pass it to the "soup" to parse:
soup = BeautifulSoup(response.content, "html.parser") # you can also use "lxml" or "html5lib" instead of "html.parser"
BeautifulSoup understands the "file-like" objects as well - which means that, once you figure out your SSL error issue you can do:
data = urllib.request.urlopen(url)
soup = BeautifulSoup(data, "html.parser")
I did not frame my question correctly in the first place. But after furthering researching, I found out that I was really trying to extract all the URLs within the referenced url tags. With some more background of the Beautiful Soup, I would use soup.find_all('a').

Scraping website in which html is injected with javascript

I am trying to get the url and sneaker titles at https://stockx.com/sneakers.
This is my code so far:
in main.py
from bs4 import BeautifulSoup
from utils import generate_request_header
import requests
url = "https://stockx.com/sneakers"
html = requests.get(url, headers=generate_request_header()).content
soup = BeautifulSoup(html, "lxml")
print soup
in utils.py
def generate_request_header():
header = BASE_REQUEST_HEADER
header["User-Agent"] = random.choice(USER_AGENT_HEADER_LIST)
return header
But whenever I print soup, I get the following output: https://pastebin.com/Ua6B6241. There doesn't seem to be any HTML extracted. How would I get it? Should I be using something like Selenium?
requests doesn't seem to be able to verify the ssl certificates, to temporarily bypass this error, you can use verify=False, i.e.:
requests.get(url, headers=generate_request_header(), verify=False)
To fix it permanently, you may want to read:
http://docs.python-requests.org/en/master/user/advanced/#ssl-cert-verification
I'm guessing the data you're looking for are at line 126 in the pastebin. I've never tried to extract the text of a script but I'm sure it could be done.
In lxml, something like:
source_code.xpath('//script[#type="text/javascript"]') should return a list of all the scripts as objects.
Or to try and get straight to the "tickers":
[i for i in source_code.xpath('//script[#type="text/javascript"]') if 'tickers' in i.xpath('string')]

BeautifulSoup 'AttributeError'

I'm getting "AttributeError: 'NoneType' object has no attribute 'string'" when I run the following. however, when the same tasks are performed on a block string variable; it works.
Any Ideas as to what I'm doing wrong?
from BeautifulSoup import BeautifulSoup
from urllib import urlopen
url = ("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&explaintext")
print ((BeautifulSoup(((urlopen(url)).read()))).find('extract').string).split("\n", 1)[0]
from BeautifulSoup import BeautifulSoup
from urllib import urlopen
url = ("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&titles=Albert%20Einstein&explaintext")
soup = BeautifulSoup(urlopen(url).read())
print soup.find('extract') # returns None
The find method is not finding anything with the tag 'extract'. If you want to see it work then give it a HTML tag that exists in the document like 'pre' or 'html'
'extract' looks like an xml tag. You might want to try reading the BeautifulSoup documentation on parsing XML - http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html#Parsing XML. Also there is a new version of BeautifulSoup out there (bs4). I find the API much nicer.

Categories

Resources