I am struggling creating one of my first projects on python3. When I use the following code:
def scrape_offers():
r = requests.get("https://www.olx.bg/elektronika/kompyutrni-aksesoari-chasti/aksesoari-chasti/q-1070/?search%5Border%5D=filter_float_price%3Aasc", cookies=all_cookies)
soup = BeautifulSoup(r.text,"html.parser")
offers = soup.find_all("div",{'class':'offer-wrapper'})
for offer in offers:
offer_name = offer.findChildren("a", {'class':'marginright5 link linkWithHash detailsLink'})
print(offer_name.text.strip())
I get the following error:
Traceback (most recent call last):
File "scrape_products.py", line 45, in <module>
scrape_offers()
File "scrape_products.py", line 40, in scrape_offers
print(offer_name.text.strip())
File "/usr/local/lib/python3.7/site-packages/bs4/element.py", line 2128, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I've read many similar cases on StackOverFlow but I still can't help myself. If someone have any ideas, please help :)
P.S.: If i run the code without .text it show the entire <a class=...> ... </a>
findchildren returns a list. Sometimes you get an empty list, sometimes you get a list with one element.
You should add an if statement to check if the length of the returned list is greater than 1, then print the text.
import requests
from bs4 import BeautifulSoup
def scrape_offers():
r = requests.get("https://www.olx.bg/elektronika/kompyutrni-aksesoari-chasti/aksesoari-chasti/q-1070/?search%5Border%5D=filter_float_price%3Aasc")
soup = BeautifulSoup(r.text,"html.parser")
offers = soup.find_all("div",{'class':'offer-wrapper'})
for offer in offers:
offer_name = offer.findChildren("a", {'class':'marginright5 link linkWithHash detailsLink'})
if (len(offer_name) >= 1):
print(offer_name[0].text.strip())
scrape_offers()
Related
I am making a program for web scraping but this is my first time. The tutorial that I am using is built for python 2.7, but I am using 3.8.2. I have mostly edited my code to fit it to python 3, but one error pops up and I can't fix it.
import requests
import csv
from bs4 import BeautifulSoup
url = 'http://www.showmeboone.com/sheriff/JailResidents/JailResidents.asp'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(features="html.parser")
results_table = soup.find('table', attrs={'class': 'resultsTable'})
output = []
for row in results_table.findAll('tr'):
output_rows = []
for cell in tr.findAll('td'):
output_rows.append(cell.text.replace(' ', ''))
output.append(output_rows)
print(output)
handle = open('out-using-requests.csv', 'a')
outfile = csv.writer(handle)
outfile.writerows(output)
The error I get is:
Traceback (most recent call last):
File "C:\Code\scrape.py", line 17, in <module>
for row in results_table.findAll('tr'):
AttributeError: 'NoneType' object has no attribute 'findAll'
The tutorial I am using is https://first-web-scraper.readthedocs.io/en/latest/
I tried some other questions, but they didn't help.
Please help!!!
Edit: Never mind, I got a good answer.
find returns None if it doesn't find a match. You need to check for that before attempting to find any sub elements in it:
results_table = soup.find('table', attrs={'class': 'resultsTable'})
output = []
if results_table:
for row in results_table.findAll('tr'):
output_rows = []
for cell in tr.findAll('td'):
output_rows.append(cell.text.replace(' ', ''))
output.append(output_rows)
The error allows the following conclusion:
results_table = None
Therefore, you cannot access the findAll() method because None.findAll() does not exist.
You should take a look, it is best to use a debugger to run through your program and see how the variables change line by line and why the mentioned line only returns ```None''. Especially important is the line:
results_table = soup.find('table', attrs={'class': 'resultsTable'})
Because in this row results_table is initialized yes, so here the above none'' value is returned andresults_table'' is assigned.
I'm trying to extract the price of the item from my programme by parsing the HTML with help of "bs4" BeautifulSoup library
import requests
import re
from bs4 import BeautifulSoup
request = requests.get("https://www.aliexpress.com/item/Original-Nokia-Lumia-1020-Nokia-Phone-41MP-Camera-Dual-Core-1-5GHz-32GB-ROM-2-GB/32415650302.html?spm=2114.search0104.3.1.67455f99ocHZOB&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10342_10343_10059_10340_10341_10696_100031_10084_10083_10103_524_10618_10624_10307_10623_10622_10621_10620,searchweb201603_43,ppcSwitch_5&algo_expid=a182685b-0e22-4a88-a7be-6a51dfbeac21-3&algo_pvid=a182685b-0e22-4a88-a7be-6a51dfbeac21&priceBeautifyAB=0")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("span",{"itemprop":"price", "id":"j-sku-price","class":"p-price"},text= not None)
pattern_1 = re.compile("/d+./d+").findall(element).text.strip()
print(pattern_1)
print(element)
and here is what I get as output :
Traceback (most recent call last):
File "/root/Desktop/Visual_Studio_Files/Python_sample.py", line 9, in <module>
pattern_1 = (re.compile("/d+./d+").findall(str_ele)).text.strip()
TypeError: expected string or bytes-like object
re.findall freaks out because your element variable has the type bs4.element.Tag.
You can find this out by adding print(type(element)) in your script.
Based on some quick poking around, I think you can extract the string you need from the tag using the contents attribute (which is a list) and taking the first member of this list (index 0).
Moreover, re.findall also returns a list, so instead of .text you need to use [0] to access its first member. Thus you will once again have a string which supports the .strip() method!
Last but not least, it seems you may have mis-typed your slashes and meant to use \ instead of /.
Here's a working version of your code:
pattern_1 = re.findall("\d+.\d+", element.contents[0])[0].strip()
This is definitely not pretty or very pythonic, but it will get the job done.
Note that I dropped the call to re.compile because that gets run in the background when you call re.findall anyway.
here is what it finally look like :)
import requests
import re
from bs4 import BeautifulSoup
request = requests.get("https://www.aliexpress.com/item/Original-Nokia-Lumia-1020-Nokia-Phone-41MP-Camera-Dual-Core-1-5GHz-32GB-ROM-2-GB/32415650302.html?spm=2114.search0104.3.1.67455f99ocHZOB&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10342_10343_10059_10340_10341_10696_100031_10084_10083_10103_524_10618_10624_10307_10623_10622_10621_10620,searchweb201603_43,ppcSwitch_5&algo_expid=a182685b-0e22-4a88-a7be-6a51dfbeac21-3&algo_pvid=a182685b-0e22-4a88-a7be-6a51dfbeac21&priceBeautifyAB=0")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("span",{"itemprop":"price", "id":"j-sku-price","class":"p-price"}).text.strip()
# pattern_1 = re.compile("/d+./d+").findall(element)
# print (pattern_1)
print (element)
and this is the output :)
146.00
thank you every one :)
After cruising through about a dozen questions of the same variety, and consulting a coworker, I have determined I need some expert insight
with open("c:\source\list.csv") as f:
for row in csv.reader(f):
for url in row:
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
tables = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"}).append
for rows in table.find_all('tr', {'releasetype': 'Current_Releases'}):
item = [].append
for val in row.find_all('td'):
item.append(val.text.encode('utf8').strip())
rows.append(item)
headers = [header.text for header in tables.find_all('th')].append
rows = [].append
print (headers)
So what I have here is : a csv file that has 30 URLs in it. I first dump them into Soup to get all of its contents, then bind the specific HTML element (the tables) to the tables variable. After this, i am trying to pull specific rows and headers from those tables.
According to logical thinking in my brain, it should work, but instead, i get this:
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
AttributeError: 'function' object has no attribute 'find_all'
Line 7 is
for rows in table.find_all('tr', {'releasetype': 'Current_Releases'}):
What are we missing here?
You have some strange misconceptions about Python syntax. Four times in your code you refer to <something>.append; I'm not sure what you think this does, but append is a method and it not only must be called, with (), but it needs a parameter: the thing you are appending.
So, for example, this line:
item = [].append
makes no sense at all; what are you expecting item to be? What are you hoping to append? Surely you just mean item = [].
In the specific case, the error is because of the superfluous append on the end of the previous line:
tables = soup.find('table', attrs={"class": "hpui-standardHrGrid-table"}).append
Again, just remove the append.
I am newbie to data scraping. This is my first program i am writting in python to scrape data and store it into the text file. I have written following code to scrape the data.
from bs4 import BeautifulSoup
import urllib2
text_file = open("scrape.txt","w")
url = urllib2.urlopen("http://ga.healthinspections.us/georgia/search.cfm?1=1&f=s&r=name&s=&inspectionType=&sd=04/24/2016&ed=05/24/2016&useDate=NO&county=Appling&")
content = url.read()
soup = BeautifulSoup(content, "html.parser")
type = soup.find('span',attrs={"style":"display:inline-block; font- size:10pt;"}).findAll()
for found in type:
text_file.write(found)
However i run this program using command prompt it shows me following error.
c:\PyProj\Scraping>python sample1.py
Traceback (most recent call last):
File "sample1.py", line 9, in <module>
text_file.write(found)
TypeError: expected a string or other character buffer object
What am i missing here, or is there anything i haven't added to. Thanks.
You need to check if type is None, ie soup.find did not actually find what you searched.
Also, don't use the name type, it's a builtin.
find, much like find_all return one/a list of Tag object(s). If you call print on a Tag you see a string representation. This automatism isn;t invoked on file.write. You have to decide what attribute of found you want to write.
I have next code:
for table in soup.findAll("table","tableData"):
for row in table.findAll("tr"):
data = row.findAll("td")
url = data[0].a
print type(url)
I get next output:
<class 'bs4.element.Tag'>
That means, that url is object of class Tag and i could get attribytes from this objects.
But if i replace print type(url) to print url['href'] i get next traceback
Traceback (most recent call last):
File "baseCreator.py", line 57, in <module>
createStoresTable()
File "baseCreator.py", line 46, in createStoresTable
print url['href']
TypeError: 'NoneType' object has no attribute '__getitem__'
What is wrong? And how i can get value of href attribute.
I do like BeautifulSoup but I personally prefer lxml.html (for not too wacky HTML) because of the ability to utilise XPath.
import lxml.html
page = lxml.html.parse('http://somesite.tld')
print page.xpath('//tr/td/a/#href')
Might need to implement some form of "axes" though depending on the structure.
You can also use elementsoup as a parser - details at http://lxml.de/elementsoup.html