How can I use regex within scrapy? I've searched a lot but could not find any good instruction to go with. However, I've tried like following but it throws an exception which I'm gonna paste below.
import requests, re
from scrapy import Selector
LINK = 'http://www.viperinnovations.com/products-and-services/cableguardian'
def get_item(url):
res = requests.get(url)
sel = Selector(res)
email = re.findall(r'[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',sel)[0]
print(email)
if __name__ == '__main__':
get_item(LINK)
The exception it throws upon execution:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\demo.py", line 13, in <module>
get_item(LINK)
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\demo.py", line 9, in get_item
email = re.findall(r'[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',sel)[0]
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
The email within my scraper above is just a placeholder. All I wanna know is how can I use regex within scrapy. Thanks for any help.
A Selector isn't a string, it's an object that knows how to run queries on an HTML string or response object to find sub-elements.
Once you've found the element or elements you want (it will find a list of elements if there are any non-singular queries), the extract method will let you get the text of the found element or elements.
For example:
>>> Selector(text=body)
<Selector (text)>
>>> Selector(text=body).xpath('//span/text()')
<Selector (text) xpath=//title/text()>
>>> Selector(text=body).xpath('//span/text()').extract()
['First span', 'Second span', 'Third span']
It's only the last one you can do anything useful to with a regex:
>>> [match
... for text in Selector(text=body).xpath('//span/text()').extract()
... for match in re.findall(r'[a-z]*\s', text)]
['irst ', 'econd ', 'hird ']
Related
I am struggling creating one of my first projects on python3. When I use the following code:
def scrape_offers():
r = requests.get("https://www.olx.bg/elektronika/kompyutrni-aksesoari-chasti/aksesoari-chasti/q-1070/?search%5Border%5D=filter_float_price%3Aasc", cookies=all_cookies)
soup = BeautifulSoup(r.text,"html.parser")
offers = soup.find_all("div",{'class':'offer-wrapper'})
for offer in offers:
offer_name = offer.findChildren("a", {'class':'marginright5 link linkWithHash detailsLink'})
print(offer_name.text.strip())
I get the following error:
Traceback (most recent call last):
File "scrape_products.py", line 45, in <module>
scrape_offers()
File "scrape_products.py", line 40, in scrape_offers
print(offer_name.text.strip())
File "/usr/local/lib/python3.7/site-packages/bs4/element.py", line 2128, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
I've read many similar cases on StackOverFlow but I still can't help myself. If someone have any ideas, please help :)
P.S.: If i run the code without .text it show the entire <a class=...> ... </a>
findchildren returns a list. Sometimes you get an empty list, sometimes you get a list with one element.
You should add an if statement to check if the length of the returned list is greater than 1, then print the text.
import requests
from bs4 import BeautifulSoup
def scrape_offers():
r = requests.get("https://www.olx.bg/elektronika/kompyutrni-aksesoari-chasti/aksesoari-chasti/q-1070/?search%5Border%5D=filter_float_price%3Aasc")
soup = BeautifulSoup(r.text,"html.parser")
offers = soup.find_all("div",{'class':'offer-wrapper'})
for offer in offers:
offer_name = offer.findChildren("a", {'class':'marginright5 link linkWithHash detailsLink'})
if (len(offer_name) >= 1):
print(offer_name[0].text.strip())
scrape_offers()
I'm trying to extract the price of the item from my programme by parsing the HTML with help of "bs4" BeautifulSoup library
import requests
import re
from bs4 import BeautifulSoup
request = requests.get("https://www.aliexpress.com/item/Original-Nokia-Lumia-1020-Nokia-Phone-41MP-Camera-Dual-Core-1-5GHz-32GB-ROM-2-GB/32415650302.html?spm=2114.search0104.3.1.67455f99ocHZOB&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10342_10343_10059_10340_10341_10696_100031_10084_10083_10103_524_10618_10624_10307_10623_10622_10621_10620,searchweb201603_43,ppcSwitch_5&algo_expid=a182685b-0e22-4a88-a7be-6a51dfbeac21-3&algo_pvid=a182685b-0e22-4a88-a7be-6a51dfbeac21&priceBeautifyAB=0")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("span",{"itemprop":"price", "id":"j-sku-price","class":"p-price"},text= not None)
pattern_1 = re.compile("/d+./d+").findall(element).text.strip()
print(pattern_1)
print(element)
and here is what I get as output :
Traceback (most recent call last):
File "/root/Desktop/Visual_Studio_Files/Python_sample.py", line 9, in <module>
pattern_1 = (re.compile("/d+./d+").findall(str_ele)).text.strip()
TypeError: expected string or bytes-like object
re.findall freaks out because your element variable has the type bs4.element.Tag.
You can find this out by adding print(type(element)) in your script.
Based on some quick poking around, I think you can extract the string you need from the tag using the contents attribute (which is a list) and taking the first member of this list (index 0).
Moreover, re.findall also returns a list, so instead of .text you need to use [0] to access its first member. Thus you will once again have a string which supports the .strip() method!
Last but not least, it seems you may have mis-typed your slashes and meant to use \ instead of /.
Here's a working version of your code:
pattern_1 = re.findall("\d+.\d+", element.contents[0])[0].strip()
This is definitely not pretty or very pythonic, but it will get the job done.
Note that I dropped the call to re.compile because that gets run in the background when you call re.findall anyway.
here is what it finally look like :)
import requests
import re
from bs4 import BeautifulSoup
request = requests.get("https://www.aliexpress.com/item/Original-Nokia-Lumia-1020-Nokia-Phone-41MP-Camera-Dual-Core-1-5GHz-32GB-ROM-2-GB/32415650302.html?spm=2114.search0104.3.1.67455f99ocHZOB&ws_ab_test=searchweb0_0,searchweb201602_3_10152_10065_10151_10344_10068_10342_10343_10059_10340_10341_10696_100031_10084_10083_10103_524_10618_10624_10307_10623_10622_10621_10620,searchweb201603_43,ppcSwitch_5&algo_expid=a182685b-0e22-4a88-a7be-6a51dfbeac21-3&algo_pvid=a182685b-0e22-4a88-a7be-6a51dfbeac21&priceBeautifyAB=0")
content = request.content
soup = BeautifulSoup(content,"html.parser")
element = soup.find("span",{"itemprop":"price", "id":"j-sku-price","class":"p-price"}).text.strip()
# pattern_1 = re.compile("/d+./d+").findall(element)
# print (pattern_1)
print (element)
and this is the output :)
146.00
thank you every one :)
I am trying to learn how to automatically fetch urls from a page. In the following code I am trying to get the title of the webpage:
import urllib.request
import re
url = "http://www.google.com"
regex = r'<title>(,+?)</title>'
pattern = re.compile(regex)
with urllib.request.urlopen(url) as response:
html = response.read()
title = re.findall(pattern, html)
print(title)
And I get this unexpected error:
Traceback (most recent call last):
File "path\to\file\Crawler.py", line 11, in <module>
title = re.findall(pattern, html)
File "C:\Python33\lib\re.py", line 201, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
What am I doing wrong?
You want to convert html (a byte-like object) into a string using .decode, e.g. html = response.read().decode('utf-8').
See Convert bytes to a Python String
The problem is that your regex is a string, but html is bytes:
>>> type(html)
<class 'bytes'>
Since python doesn't know how those bytes are encoded, it throws an exception when you try to use a string regex on them.
You can either decode the bytes to a string:
html = html.decode('ISO-8859-1') # encoding may vary!
title = re.findall(pattern, html) # no more error
Or use a bytes regex:
regex = rb'<title>(,+?)</title>'
# ^
In this particular context, you can get the encoding from the response headers:
with urllib.request.urlopen(url) as response:
encoding = response.info().get_param('charset', 'utf8')
html = response.read().decode(encoding)
See the urlopen documentation for more details.
Based upon last one, this was smimple to do when pdf read was done .
text = text.decode('ISO-8859-1')
Thanks #Aran-fey
I am newbie to data scraping. This is my first program i am writting in python to scrape data and store it into the text file. I have written following code to scrape the data.
from bs4 import BeautifulSoup
import urllib2
text_file = open("scrape.txt","w")
url = urllib2.urlopen("http://ga.healthinspections.us/georgia/search.cfm?1=1&f=s&r=name&s=&inspectionType=&sd=04/24/2016&ed=05/24/2016&useDate=NO&county=Appling&")
content = url.read()
soup = BeautifulSoup(content, "html.parser")
type = soup.find('span',attrs={"style":"display:inline-block; font- size:10pt;"}).findAll()
for found in type:
text_file.write(found)
However i run this program using command prompt it shows me following error.
c:\PyProj\Scraping>python sample1.py
Traceback (most recent call last):
File "sample1.py", line 9, in <module>
text_file.write(found)
TypeError: expected a string or other character buffer object
What am i missing here, or is there anything i haven't added to. Thanks.
You need to check if type is None, ie soup.find did not actually find what you searched.
Also, don't use the name type, it's a builtin.
find, much like find_all return one/a list of Tag object(s). If you call print on a Tag you see a string representation. This automatism isn;t invoked on file.write. You have to decide what attribute of found you want to write.
This program is a very simple example of webscraping. The programs goal is to go on the internet, find a specific stock, and then tell the user the price that the stock is currently trading at. However, I run into the issue in the code that when I compile it, this error message comes up:
Traceback (most recent call last):
File "HTMLwebscrape.py", line 15, in <module>
price = re.findall(pattern,htmltext)
File "C:\Python34\lib\re.py", line 210, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Below is the actual script of the program. I've tried finding ways to solve this code, but so far, I've been unable to. I've been running this on Python 3 and using Submlime Text as my text editor. Thank you in advance!
import urllib
import re
from urllib.request import Request, urlopen
from urllib.error import URLError
symbolslist = ["AAPL","SPY","GOOG","NFLX"]
i=0
while i < len(symbolslist):
url = Request("http://finance.yahoo.com/q?s=AAPL&ql=1")
htmlfile = urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_184_'+symbolslist[i] + '"">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print (price)
i+=1