How to scrape a XML website using bs4?

How to scrape a XML website using bs4? - python

I am parsing websites which sell electronic products..
Specifically, I am looking to collect the name and the price of the product
I ran into a small problem when parsing a xml based site....
Here is my code:
>>> import urllib2
>>> from bs4 import BeautifulSoup
>>> url=urllib2.urlopen("http://store.explorelabs.com/index.php?main_page=products_all")
>>> soup=BeautifulSoup(url,"xml")
>>> data=soup.find_all(colspan="2")
The code above works
now when I do this (as the name is inside the strong tags)
>>> data.strong
or
>>> data.attrs
It shows me this:
Traceback (most recent call last):
File "<pyshell#10>", line 1, in <module>
data.strong
AttributeError: 'ResultSet' object has no attribute 'strong'
or
Traceback (most recent call last):
File "<pyshell#17>", line 1, in <module>
data.find_all('a')
AttributeError: 'ResultSet' object has no attribute 'find_all'
I am trying to iterate and try to find out more.
Any pointers would be very helpful.

find_all returns a list of elements that match, not one. Loop over the result set to get the individual items:
for element in data:
element.attrs

Related

WebScraping With BS4, AttributeError with Find_all

This is the code im running.
import bs4
import requests
from bs4 import BeautifulSoup
r=requests.get('https://finance.yahoo.com/quote/SAS.ST/?guccounter=1')
soup=bs4,BeautifulSoup(r.text,"xml")
soup.find_all('div')
And when i run it the output is
> Traceback (most recent call last): File
> "/Users/darre/Desktop/script.py3", line 8, in <module>
> bi=soup.find_all('div') AttributeError: 'tuple' object has no attribute 'find_all'

Here is error, use soup=BeautifulSoup(r.text,"lxml") instead of
soup=bs4,BeautifulSoup(r.text,"xml")
BeautifulSoup use different parser details here parser description here

The module browserhistory doesn't seem to work and raises KeyError

Running the example from pip modified to choose the 'chrome' browser, I get a KeyError
script:
import browserhistory as bh
dict_obj = bh.get_browserhistory()
data = dict_obj.keys()
print(data)
data = dict_obj['chrome'][0]
print(data)
output:
dict_keys([])
Traceback (most recent call last):
File "/home/raj/Desktop/anilyzer/myproject/main.py", line 6, in
data = dict_obj['chrome'][0]
KeyError: 'chrome'
What is happening?

It doesn't look like the module is maintained and cannot detect your browser(s)
https://github.com/kcp18/browserhistory
The function bh.get_browserhistory() should return a dictionary where the type of browser is the lookup key. Displaying this as you do shows that the dictionary is empty.
>>> data = dict_obj.keys()
>>> print(data)
dict_keys([])
This is why you get a KeyError when attempting to read a specific key from the dict
>>> d = {}
>>> type(d)
<class 'dict'>
>>> d["something"]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'something'
However, there appears to be a different project which may do what you want https://github.com/pesos/browser-history
This is from reading the GitHub Issues: https://github.com/kcp18/browserhistory/issues and I cannot comment on its quality or safety!

Getting Error "KeyError" when extracting JSON values

I am successful in extracting the response from a JSON. However, I am unable to list all or extract what I need on the key and its pair
Below is my code:
import requests
response = requests.get("https://www.woolworths.com.au/apis/ui/Product/Specials/half-price?GroupID=948&isMobile=false&pageNumber=1&pageSize=36&richRelevanceId=SP_948&sortType=Personalised")
data = response.json()
I tried to do data['Stockcode']
but no luck or I use data['Product']
It says:
>>> data['Product']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'Product'
>>> data['Products']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'Products'

try:
>>> data['Items'][0]['Products']
Print data and see its data structure How its constructed then you can extract values as per your need

adidas script: 'NoneType' object is not subscriptable

Need help with an adidas auto checkout script. Getting the following error:
Traceback (most recent call last):
File "adidas.py", line 169, in <module>
checkout()
File "adidas.py", line 80, in checkout
url = soup.find('div', {'class': 'cart_wrapper rbk_shadow_angle rbk_wrapper_checkout summary_wrapper'})['data-url']
TypeError: 'NoneType' object is not subscriptable
Link to the entire script: https://github.com/kfichter/OpenATC/blob/482360a7a160136a4969d2cf0527809660d021fb/Scripts/adidas.py

soup.find() is returning None. You are trying to look up the key 'data-url' in this result, but None does not support key lookup.
Depending on what you're trying to do, you should either change the query so it doesn't return None, or check that the value is not None before trying to access the 'data-url' key.

Has no attribute getitem

I'm trying to integrate Pagseguro (a brazilian payment service, similar to PayPal) with this lib
https://github.com/rochacbruno/python-pagseguro
But, I don't know how to access the data from notification that the service sends to me. This is my code:
notification_code = request.POST['notificationCode']
pg = PagSeguro(email="testPerson#gmail.com", token="token")
notification_data = pg.check_notification(notification_code)
print notification_data['status']
In the las line I receive this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'PagSeguroNotificationResponse' object has no attribute '__getitem__'

The documentation in the README doesn't seem to match the code. It looks like rather than notication_data being a dictionary it is an object that has attributes matching the dictionary keys from the README.
So this should work if you just change print notification_data['status'] to the following:
print notification_data.status

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape a XML website using bs4? - python

find_all returns a list of elements that match, not one. Loop over the result set to get the individual items: for element in data: element.attrs

Related

WebScraping With BS4, AttributeError with Find_all

The module browserhistory doesn't seem to work and raises KeyError

Getting Error "KeyError" when extracting JSON values

adidas script: 'NoneType' object is not subscriptable

Has no attribute getitem

Categories

Resources