I am trying to scrape stock prices from Yahoo! Finance into a local database as per a tutorial by Chris Reeves, and I keep getting the above error when trying to execute this code. Can anyone tell me what is wrong here? Thanks.
from threading import Thread
import urllib
import re
import MySQLdb
gmap = {}
def th(ur):
base = "http://finance.yahoo.com/q?s="+ur
regex = '<span id="yfs_l84_'+ur.lower()+'">(.+?)</span>'
pattern = re.compile(regex)
htmltext = urllib.urlopen(base).read()
results = re.findall(pattern, htmltext)
try:
gmap[ur] = results[0]
except:
print "Got an error"
symbolslist = open("multithread/stocks.txt").read()
symbolslist = symbolslist.replace(" ","").split(",")
print symbolslist
threadlist = []
for u in symbolslist:
t = Thread(target=th,args=(u,))
t.start()
threadlist.append(t)
for b in threadlist:
b.join()
This is the exact error that I'm getting:
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Python27\lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Python27\lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "multithread/threads.py", line 11, in th
pattern = re.compile(regex)
File "C:\Python27\lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\lib\re.py", line 242, in _compile
raise error, v # invalid expression
error: unexpected end of regular expression
Alas, you didn't show us the important part. That is, print symbolslist. Something in that list is creating an invalid regular expression when you paste it into the <span ... boilerplate.
You can probably fix it by changing that line like so:
regex = '<span id="yfs_l84_' + re.escape(ur.lower()) + '">(.+?)</span>'
^^^^^^^^^^ ^
However, if that works, it would probably only be hiding the real problem. The real problem is probably that you have some kind of nonsense in symbolslist.
Related
I'm working on a web scraper project with HTMLSession, I plan to scrape Ask search engine results using a set of user-defined keywords. I have already started with writing the code for my scraper, here it is:
from requests_html import HTMLSession
class Scraper():
def scrapedata(self,tag):
url = f'https://www.ask.com/web?q={tag}'
s = HTMLSession()
r = s.get(url)
print(r.status_code)
qlist = []
ask = r.html.find('div.PartialSearchResults-item')
for a in ask:
print(a.find('a.PartialSearchResults-item-title-link.result-link::text', first = True ).text.strip())
ask = Scraper()
ask.scrapedata('ferrari')
However when I run this code, instead of getting the list of all the web page titles related to the keywords searched in my terminal as it should have, I get the following errors:
[Running] python -u "c:\Users\user\Documents\AAprojects\Whelpsgroups1\Beauty\scraper.py"
200
Traceback (most recent call last):
File "c:\Users\user\Documents\AAprojects\Whelpsgroups1\Beauty\scraper.py", line 19, in <module>
ask.scrapedata('ferrari')
File "c:\Users\user\Documents\AAprojects\Whelpsgroups1\Beauty\scraper.py", line 15, in scrapedata
print(a.find('a.PartialSearchResults-item-title-link.result-link::text', first = True ).text.strip())
File "C:\Python310\lib\site-packages\requests_html.py", line 212, in find
for found in self.pq(selector)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 261, in __call__
result = self._copy(*args, parent=self, **kwargs)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 247, in _copy
return self.__class__(*args, **kwargs)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 232, in __init__
xpath = self._css_to_xpath(selector)
File "C:\Python310\lib\site-packages\pyquery\pyquery.py", line 243, in _css_to_xpath
return self._translator.css_to_xpath(selector, prefix)
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 190, in css_to_xpath
return ' | '.join(self.selector_to_xpath(selector, prefix,
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 190, in <genexpr>
return ' | '.join(self.selector_to_xpath(selector, prefix,
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 222, in selector_to_xpath
xpath = self.xpath_pseudo_element(xpath, selector.pseudo_element)
File "C:\Python310\lib\site-packages\cssselect\xpath.py", line 232, in xpath_pseudo_element
raise ExpressionError('Pseudo-elements are not supported.')
cssselect.xpath.ExpressionError: Pseudo-elements are not supported.
[Done] exited with code=1 in 17.566 seconds
I don't even know what this means, I searched the Internet but instead came across problems related to IE7 and I don't see what has to do with my problem, especially since I'm using Microsoft Edge as my default web browser. Also, I hope to count on the help of more experienced members of the community to help me solve the problem. Thank you from Cameroon.
Just remove the ::text Like this or print(a.find("a.PartialsearchResults-item-title-link.result-link", first = True).text.strip()) and you will get the titles of your Webpages.
I am having a problem with the bit of code shown below. My original code worked when I was just puling the tweet information. Once I edited it to extract the URL within the text it started to give me problems. Nothing is printing and I am receiving these errors.
Traceback (most recent call last):
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 22, in <module>
get_tweets(api, "cnn")
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 18, in get_tweets
url2 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
text)
File "C:\Users\Evan\AppData\Local\Programs\Python\Python39\lib\re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
I am receiving no errors before I run it, so I am extremely confused about why this is not working. It will probably be something simple as I am new to using both Tweepy and Regex.
import tweepy
import re
TWITTER_APP_SECRET = 'hidden'
TWITTER_APP_KEY = 'hidden'
auth = tweepy.OAuthHandler(TWITTER_APP_KEY, TWITTER_APP_SECRET)
api = tweepy.API(auth)
def get_tweets(api, username):
page = 1
while True:
tweets = api.user_timeline(username, page=page)
for tweet in tweets:
text = tweet.text.encode("utf-8")
url2 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-
F]))+', text)
print(url2)
get_tweets(api, "cnn")
Errors again:
Traceback (most recent call last):
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 22, in <module>
get_tweets(api, "cnn")
File "C:\Users\Evan\PycharmProjects\DiscordBot1\main.py", line 18, in get_tweets
url2 = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',
text)
File "C:\Users\Evan\AppData\Local\Programs\Python\Python39\lib\re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
Process finished with exit code 1
Tell me if you need more information to help me, any help is appreciated, thanks in advance.
You're getting that error because you are using a string pattern (your regex) against a string which you've turned into a bytes object via encode().
Try running your pattern directly against tweet.text without encoding it.
I have a function that tries a list of regexes on some text to see if there's a match.
#timeout(1)
def get_description(data, old):
description = None
if old:
for rx in rxs:
try:
matched = re.search(rx, data, re.S|re.M)
if matched is not None:
try:
description = matched.groups(1)
if description:
return description
else:
continue
except TimeoutError as why:
print(why)
continue
else:
continue
except Exception as why:
print(why)
pass
I use this function in a loop and run a bunch of text files through. In one file, execution keeps stopping:
Traceback (most recent call last):
File "extract.py", line 223, in <module>
scrape()
File "extract.py", line 40, in scrape
metadata = get_metadata(f)
File "extract.py", line 186, in get_metadata
description = get_description(text, True)
File "extract.py", line 64, in get_description
matched = re.search(rx, data, re.S|re.M)
File "C:\Users\Joseph\AppData\Local\Programs\Python\Python36\lib\re.py", line 182, in search
return _compile(pattern, flags).search(string)
KeyboardInterrupt
It simply hangs on evaluating matched = re.search(rx, data, re.S|re.M). For many other files, when no match is found, it goes on to the next regex. With this file, it does nothing and throws no exception. Any ideas what could be causing this?
EDIT:
I'm now trying to detect timeout errors (This is more efficient for me than changing the rx's)
The TimeoutError, borrowed from this question, is triggered but doesn't cause the script to keep running. It simply writes 'Timer expired' and stays frozen.
When my python code tried to use simplify it shows following error. This problem showed after i run separate code file of pyparsing(Which execute successfully). The same code is working fine before.
Edit:
>>> expression="a+b+z"
>>> t=simplify(expression)
ast.py:4: SyntaxWarning: invalid pattern (**) passed to Regex
operator = pp.Regex("**").setName("operator")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\sympy\simplify\simplify.py", line 507, in simplify
expr = sympify(expr)
File "C:\Python27\lib\site-packages\sympy\core\sympify.py", line 308, in sympify
from sympy.parsing.sympy_parser import (parse_expr, TokenError,
File "C:\Python27\lib\site-packages\sympy\parsing\sympy_parser.py", line 11, in <module>
import ast
File "ast.py", line 4, in <module>
operator = pp.Regex("**").setName("operator")
File "C:\Python27\lib\site-packages\pyparsing.py", line 1920, in __init__
self.re = re.compile(self.pattern, self.flags)
File "C:\Python27\Lib\re.py", line 190, in compile
return _compile(pattern, flags)
File "C:\Python27\Lib\re.py", line 244, in _compile
raise error, v # invalid expression
sre_constants.error: nothing to repeat
Please suggest?
You have a local file, ast.py, which is getting imported in place of Python's built-in ast module. You should remove or rename this file to avoid the name conflict, as this can cause other modules to not work correctly.
Additionally, your local module contains the following line, which is causing an exception on import:
operator = pp.Regex("**").setName("operator")
** is not a valid regular expression. In a regular expression, * means "0 or more repetitions of the preceding expression", which doesn't make sense at the beginning of an expression because there is "nothing to repeat" (as the error message says).
I am trying this code from here docs
class Form(Form):
image = FileField(u'Image File', validators=[Regexp(u'^[^/\\]\.jpg$')])
def validate_image(form, field):
if field.data:
field.data = re.sub(r'[^a-z0-9_.-]', '_', field.data)
Here is the error:
Traceback (most recent call last):
File "tornadoexample2-1.py", line 111, in <module>
class Form(Form):
File "tornadoexample2-1.py", line 119, in Form
image = FileField(u'Image File', validators=[Regexp(u'^[^/\\]\.jpg$')])
File "/usr/local/lib/python2.7/dist-packages/wtforms/validators.py", line 256, in __init__
regex = re.compile(regex, flags)
File "/usr/lib/python2.7/re.py", line 190, in compile
return _compile(pattern, flags)
File "/usr/lib/python2.7/re.py", line 242, in _compile
raise error, v # invalid expression
sre_constants.error: unexpected end of regular expression
Any idea about what the problem?
The regexp in Regexp(u'^[^/\\]\.jpg$') is not quite good.
Try running this, you will get the same exception:
import re
re.compile(u'^[^/\\]\.jpg$')
You need to escape each \ slash twice inside the [] brackets.
So you can rewrite it as u'^[^/\\\\]\.jpg$' or as a raw string ur'^[^/\\]\.jpg$'.
Hope this helps.