This program is a very simple example of webscraping. The programs goal is to go on the internet, find a specific stock, and then tell the user the price that the stock is currently trading at. However, I run into the issue in the code that when I compile it, this error message comes up:
Traceback (most recent call last):
File "HTMLwebscrape.py", line 15, in <module>
price = re.findall(pattern,htmltext)
File "C:\Python34\lib\re.py", line 210, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
Below is the actual script of the program. I've tried finding ways to solve this code, but so far, I've been unable to. I've been running this on Python 3 and using Submlime Text as my text editor. Thank you in advance!
import urllib
import re
from urllib.request import Request, urlopen
from urllib.error import URLError
symbolslist = ["AAPL","SPY","GOOG","NFLX"]
i=0
while i < len(symbolslist):
url = Request("http://finance.yahoo.com/q?s=AAPL&ql=1")
htmlfile = urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_184_'+symbolslist[i] + '"">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print (price)
i+=1
Related
i'm creating a crawler in python to list all links in a website but i'm getting an error i can't see what cause it
the error is :
Traceback (most recent call last):
File "vul_scanner.py", line 8, in <module>
vuln_scanner.crawl(target_url)
File "C:\Users\Lenovo x240\Documents\website\website\spiders\scanner.py", line 18, in crawl
href_links= self.extract_links_from(url)
File "C:\Users\Lenovo x240\Documents\website\website\spiders\scanner.py", line 15, in extract_links_from
return re.findall('(?:href=")(.*?)"', response.content)
File "C:\Users\Lenovo x240\AppData\Local\Programs\Python\Python38\lib\re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
my code is : in scanner.py file:
# To ignore numpy errors:
# pylint: disable=E1101
import urllib
import requests
import re
from urllib.parse import urljoin
class Scanner:
def __init__(self, url):
self.target_url = url
self.target_links = []
def extract_links_from(self, url):
response = requests.get(url)
return re.findall('(?:href=")(.*?)"', response.content)
def crawl(self, url):
href_links= self.extract_links_from(url)
for link in href_links:
link = urljoin(url, link)
if "#" in link:
link = link.split("#")[0]
if self.target_url in link and link not in self.target_links:
self.target_links.append(link)
print(link)
self.crawl(link)
in vul_scanner.py file :
import scanner
# To ignore numpy errors:
# pylint: disable=E1101
target_url = "https://www.amazon.com"
vuln_scanner = scanner.Scanner(target_url)
vuln_scanner.crawl(target_url)
the command i run is : python vul_scanner.py
return re.findall('(?:href=")(.*?)"', response.content)
response.content in this case is of type binary. So either you use response.text, so you get pure text and can process it as you plan on doing now, or you can check this out:
Regular expression parsing a binary file?
In case you want to continue down the binary road.
Cheers
I am using the following code in an attempt to do webscraping .
import sys , os
import requests, webbrowser,bs4
from PIL import Image
import pyautogui
p = requests.get('http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1')
n = open("exml.txt" , 'wb')
for i in p.iter_content(1000) :
n.write(i)
n.close()
n = open("exml.txt" , 'r')
soupy= bs4.BeautifulSoup(n,"html.parser")
elems = soupy.select('img[src]')
for u in elems :
print (u)
so what I am intending to do is to extract all the image links that is there in the xml response obtained from the page .
(Please correct me If I am wrong in thinking that requests.get returns the whole static html file of the webpage that opens on entering the URL)
However in the line :
soupy= bs4.BeautifulSoup(n,"html.parser")
I am getting the following error :
Traceback (most recent call last):
File "../../perl/webscratcher.txt", line 24, in <module>
soupy= bs4.BeautifulSoup(n,"html.parser")
File "C:\Users\Kanishc\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4\__init__.py", line 191, in __init__
markup = markup.read()
File "C:\Users\Kanishc\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 24662: character maps to <undefined>
I am clueless about the error and the "Appdata" folder is empty .
How to proceed further ?
Post Trying suggestions :
I changed the extension of the filename to py and this error got removed . However on the following line :
soupy= bs4.BeautifulSoup(n,"lxml") I am getting the following error :
Traceback (most recent call last):
File "C:\perl\webscratcher.py", line 23, in
soupy= bs4.BeautifulSoup(p,"lxml")
File "C:\Users\PREMRAJ\AppData\Local\Programs\Python\Python36-32\lib\site-packages\bs4_init_.py", line 192, in init
elif len(markup) <= 256 and (
TypeError: object of type 'Response' has no len()
How to tackle this ?
You are over-complicating things. Pass the bytes content of a Response object directly into the constructor of the BeautifulSoup object, instead of writing it to a file.
import requests
from bs4 import BeautifulSoup
response = requests.get('http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1')
soup = BeautifulSoup(response.content, 'lxml')
for element in soup.select('img[src]'):
print(element)
Okay so you you might want to do a review on working with BeautifulSoup. I referenced an old project of mine and this is all you need for printing them. Check the BS documents to find the exact syntax you want with the select method.
This will print all the img tags from the html
import requests, bs4
site = 'http://www.goal.com/en-ie/news/ozil-agent-eviscerates-jealous-keown-over-stupid-comments/1javhtwzz72q113dnonn24mnr1'
p = requests.get(site).text
soupy = bs4.BeautifulSoup(p,"html.parser")
elems = soupy.select('img[src]')
for u in elems :
print (u)
How can I use regex within scrapy? I've searched a lot but could not find any good instruction to go with. However, I've tried like following but it throws an exception which I'm gonna paste below.
import requests, re
from scrapy import Selector
LINK = 'http://www.viperinnovations.com/products-and-services/cableguardian'
def get_item(url):
res = requests.get(url)
sel = Selector(res)
email = re.findall(r'[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',sel)[0]
print(email)
if __name__ == '__main__':
get_item(LINK)
The exception it throws upon execution:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\demo.py", line 13, in <module>
get_item(LINK)
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\demo.py", line 9, in get_item
email = re.findall(r'[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+',sel)[0]
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\re.py", line 222, in findall
return _compile(pattern, flags).findall(string)
TypeError: expected string or bytes-like object
The email within my scraper above is just a placeholder. All I wanna know is how can I use regex within scrapy. Thanks for any help.
A Selector isn't a string, it's an object that knows how to run queries on an HTML string or response object to find sub-elements.
Once you've found the element or elements you want (it will find a list of elements if there are any non-singular queries), the extract method will let you get the text of the found element or elements.
For example:
>>> Selector(text=body)
<Selector (text)>
>>> Selector(text=body).xpath('//span/text()')
<Selector (text) xpath=//title/text()>
>>> Selector(text=body).xpath('//span/text()').extract()
['First span', 'Second span', 'Third span']
It's only the last one you can do anything useful to with a regex:
>>> [match
... for text in Selector(text=body).xpath('//span/text()').extract()
... for match in re.findall(r'[a-z]*\s', text)]
['irst ', 'econd ', 'hird ']
I wrote a little Python function that uses the BeatifulSoup libary. The function should return a list of symbols from a Wikipedia article.
In the Shell, I execute the function like this:
pythonScript.my_function()
...it throws a error in line 28:
No connections adapters were found for 'link'.
When I type the same code from my function directly in the shell, it works perfectly. With the same link. I have even copied and pasted the lines.
These are the two lines of code I'm talking about, the error appears with the BeautifulSoup function.
response = requests.get('link')
soup = bs4.BeautifulSoup(response.text)
I can't explain why this error happens...
EDIT: here is the full code
#/usr/bin/python
# -*- coding: utf-8 -*-
#insert_symbols.py
from __future__ import print_function
import datetime
from math import ceil
import bs4
import MySQLdb as mdb
import requests
def obtain_parse_wiki_snp500():
'''
Download and parse the Wikipedia list of S&P500
constituents using requests and BeatifulSoup.
Returns a list of tuples for to add to MySQL.
'''
#Stores the current time, for the created at record
now = datetime.datetime.utcnow()
#Use requests and BeautifulSoup to download the
#list of S&P500 companies and obtain the symbol table
response = requests.get('http://en.wikipedia.org/wiki/list_of_S%26P_500_companies')
soup = bs4.BeautifulSoup(response.text)
I think that must be enough. This is the point where the error occurs.
In the Shell Ive done everthing step by step:
importing the libaries and then calling the requests and bs4 function.
The single difference is that in the Shell I didnt defined a function for that.
EDIT2:
Here is the exact Error Message:
Traceback (most recent call last):
File "", line 1, in
File "/home/felix/Dokumente/PythonSAT/DownloadSymbolsFromWikipedia.py", line 28, in obtain_parse_wiki_snp500
soup = bs4.BeautifulSoup(response.text)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/api.py", line 67, in get
return request('get', url, params=params, **kwargs)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/api.py", line 53, in request
return session.request(method=method, url=url, **kwargs)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 468, in request
resp = self.send(prep, **send_kwargs)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 570, in send
adapter = self.get_adapter(url=request.url)
File "/home/felix/anaconda3/lib/python3.5/site-packages/requests/sessions.py", line 644, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'htttp://en.wikipedia.org/wiki/list_of_S%26P_500_companies'
You have passed a link with a htttp, i.e you have three t when there should be just two, that code would error wherever you run it. The correct url is:
http://en.wikipedia.org/wiki/list_of_S%26P_500_companies
Once you fix the url the request works fine:
n [1]: import requests
In [2]: requests.get('http://en.wikipedia.org/wiki/list_of_S%26P_500_companies')
Out[3]: <Response [200]>
To get the symbols:
n [28]: for tr in soup.select("table.wikitable.sortable tr + tr"):
sym = tr.select_one("a.external.text")
if sym:
print(sym.text)
....:
MMM
ABT
ABBV
ACN
ATVI
AYI
ADBE
AAP
AES
AET
AFL
AMG
A
GAS
APD
AKAM
ALK
AA
AGN
ALXN
ALLE
ADS
ALL
GOOGL
GOOG
MO
AMZN
AEE
AAL
AEP
and so on ...............
I'm new to Python and stack overflow.
I'm trying to follow a tutorial on youtube (outdated I'm guessing based on the error I get) regarding fetching stock prices.
Here is the following program:
import urllib.request
import re
html = urllib.request.urlopen('http://finance.yahoo.com/q?uhb=uh3_finance_vert_gs_ctrl2&fr=&type=2button&s=AAPL')
htmltext = html.read()
regex = '<span id="yfs_l84_aapl">.+?</span>'
pattern = re.compile(regex)
price = re.findall(pattern, htmltext)
print(price)
Since this is Python 3, I had to research on urllib.request and use those methods instead of a simple urllib.urlopen.
Anyways, when I run it, I get the following error:
Traceback (most recent call last):
File "/Users/Harshil/Desktop/stockFetch.py", line 13, in <module>
price = re.findall(pattern, htmltext)
File "/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/re.py", line 206, in findall
return _compile(pattern, flags).findall(string)
TypeError: can't use a string pattern on a bytes-like object
I realize the error and attempted to fix it by adding the following:
codec = html.info().get_param('charset', 'utf8')
htmltext = html.decode(codec)
But it gives me another error:
Traceback (most recent call last):
File "/Users/Harshil/Desktop/stockFetch.py", line 9, in <module>
htmltext = html.decode(codec)
AttributeError: 'HTTPResponse' object has no attribute 'decode'
Hence, after spending reasonable amount of time, I don't know what to do. All I want to do is get the price for AAPL so I can further continue to build a general program to fetch prices for an array of stocks and use the prices in future programs.
Any help is appreciated. Thanks!
You are barking up the right tree. Try decoding the actual HTML byte string rather than the urlopen HTTPResponse:
htmltext = html.read()
codec = html.info().get_param('charset', 'utf8')
htmltext = htmltext.decode(codec)
price = re.findall(pattern, htmltext)