I'm having trouble with the following code, it's suppose to print the stock prices by accessing yahoo finance but I can't figure out why its returning empty strings?
import urllib
import re
symbolslist = ["aapl","spy", "goog","nflx"]
i = 0
while i < len(symbolslist):
url = "http://finance.yahoo.com/q?s="+symbolslist[i]+"&q1=1"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_l84_' + symbolslist[i] + '">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print price
i+=1
Edit: It works fine now, it was a syntax error. Edited the code above as well.
These are just a few helpful tips for python development (and scraping):
Python Requests library.
The python requests library is excellent at simplifying the requests process.
No need to use a while loop
for loops are really useful in this situation.
symbolslist = ["aapl","spy", "goog","nflx"]
for symbol in symbolslist:
# Do logic here...
Use xpath over regular expressions
import requests
import lxml
url = "http://www.google.co.uk/finance?q="+symbol+"&q1=1"
r = requests.get(url)
xpath = '//your/xpath'
root = lxml.html.fromstring(r.content)
No need to compile your regular expressions each time.
Compiling regex's takes time and effort. You can abstract these out of your loop.
regex = '<span id="yfs_l84_' + symbolslist[i] + '">(.+?)</span>'
pattern = re.compile(regex)
for symbol in symbolslist:
# do logic
External Libraries
As mentioned in the comment by drewk both Pandas and Matplot have native functions to get Yahoo quotes or you can use the ystockquote library to scrape from Yahoo. This is used like so:
#!/bin/env python
import ystockquote
symbolslist = ["aapl","spy", "goog","nflx"]
for symbol in symbolslist:
print (ystockquote.get_price(symbol))
Related
I have written a web scraping program in python. It is working correctly but takes 1.5 hrs to execute. I am not sure how to optimize the code.
The logic of the code is every country have many ASN's with the client name. I am getting all the ASN links (for e.g https://ipinfo.io/AS2856)
Using Beautiful soup and regex to get the data as JSON.
The output is just a simple JSON.
import urllib.request
import bs4
import re
import json
url = 'https://ipinfo.io/countries'
SITE = 'https://ipinfo.io'
def url_to_soup(url):
#bgp.he.net is filtered by user-agent
req = urllib.request.Request(url)
opener = urllib.request.build_opener()
html = opener.open(req)
soup = bs4.BeautifulSoup(html, "html.parser")
return soup
def find_pages(page):
pages = []
for link in page.find_all(href=re.compile('/countries/')):
pages.append(link.get('href'))
return pages
def get_each_sites(links):
mappings = {}
print("Scraping Pages for ASN Data...")
for link in links:
country_page = url_to_soup(SITE + link)
current_country = link.split('/')[2]
for row in country_page.find_all('tr'):
columns = row.find_all('td')
if len(columns) > 0:
#print(columns)
current_asn = re.findall(r'\d+', columns[0].string)[0]
print(SITE + '/AS' + current_asn)
s = str(url_to_soup(SITE + '/AS' + current_asn))
asn_code, name = re.search(r'(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s(&)]+)', s).groups()
#print(asn_code[2:])
#print(name)
country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>', s).group("COUNTRY")
print(country)
registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>', s, re.S).group("REGISTRY").strip()
#print(registry)
# flag re.S make the '.' special character match any character at all, including a newline;
mtch = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>', s, re.S)
if mtch:
ip = mtch.group("IP").strip()
#print(ip)
mappings[asn_code[2:]] = {'Country': country,
'Name': name,
'Registry': registry,
'num_ip_addresses': ip}
return mappings
main_page = url_to_soup(url)
country_links = find_pages(main_page)
#print(country_links)
asn_mappings = get_each_sites(country_links)
print(asn_mappings)
The output is as expected, but super slow.
You probably don't want to speed your scraper up. When you scrape a site, or connect in a way that humans don't (24/7), it's good practice to keep requests to a minium so that
You blend in the background noise
You don't (D)DoS the website in hope of finishing faster, while racking up costs for the wbesite owner
What you can do, however, is get the AS names and numbers from this website (see this SO answers), and recover the IPs using PyASN
I think what you need is to do multiple processes of the scraping . This can be done using the python multiprocessing package. Since multi threads programs do not work in python because of the GIL (Global Interpreter Lock). There are plenty of examples of how to do this. Here are some:
Multiprocessing Spider
Speed up Beautiful soup scraper
I started learning Python earlier today and as my first project I wanted to make a script that shows me today's weather forecast.
My script:
import urllib2, re
url = urllib2.urlopen('http://www.wetter.com/wetter_aktuell/wettervorhersage/heute /deutschland/oberhausen/DE0007740.html')
html = url.read()
url.close()
x = re.search("""<dl><dd><strong>(?P<uhrzeit>.*)""", html, re.S)
x = re.search("""<dd><span class="degreespan" style="font-weight:normal;">(?P<temp>.*)""", html, re.S)
print x.group('uhrzeit'), x.group('temp')
I used this as template. When I run this script I get an Index Error no such groups
You are overwriting x.
Maybe you want:
x = re.search("""<dl><dd><strong>(?P<uhrzeit>.*)""", html, re.S)
y = re.search("""<dd><span class="degreespan" style="font-weight:normal;">(?P<temp>.*)""", html, re.S)
print x.group('uhrzeit'), y.group('temp')
And I can't belive that the site you linked advocates using regular expressions for extracting information from HTML.
I am new to Python. A coder helped me out by giving me some code to parse HTML. I'm having trouble understanding how it works. My idea is for it to grab (consume?) HTML from
funtweets.com/random and basically tell me a funny joke in the morning as an alarm clock. It currently extracts all jokes on the page and I only want one. Either modifying the code or a detailed explanation as to how the code works would be helpful to me. This is the code:
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
user = re.compile(r'<span>#</span>(\w+)')
text = re.compile(r"</b></a> (\w.*)")
user_lst =[match.group(1) for match in re.finditer(user, page)]
text_lst =[match.group(1) for match in re.finditer(text, page)]
for _user, _text in zip(user_lst, text_lst):
print '#{0}\n{1}\n'.format(_user,_text)
user3530608 you want one match, instead of iterating through matches?
This is a nice way to get started with python regular expressions.
Here is a small tweak to your code. I don't have python in front of me to test it, so let me know if you run into any issues.
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>#</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print '#{0}\n{1}\n'.format(user,text)
Although you can parse html by regex , but I strongly suggest you to use some python third's lib.
My favorest htmlparser lib is PyQuery, you can use it as jquery:
such as
from pyquery import PyQuery as pq
page=pq(url='http://www.m.funtweets.com/random')
users=page("#user_id")
a_first=page("a:first")
...
You can find it here:https://pypi.python.org/pypi/pyquery
Just:
pip install PyQuery
or
easy_install PyQuery
You'll love it !
Another htmlparse-lib: https://pypi.python.org/pypi/beautifulsoup4/4.3.2
If anyone is interested in getting only one joke from the html with no html tags, here is the final code:
import re
import urllib2
def remove_html_tags(text):
pattern = re.compile(r'</b></a>')
return pattern.sub('', text)
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>#</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print remove_html_tags(text)
For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))
Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.
Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>
Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html
I need to extract the meta keywords from a web page using Python. I was thinking that this could be done using urllib or urllib2, but I'm not sure. Anyone have any ideas?
I am using Python 2.6 on Windows XP
lxml is faster than BeautifulSoup (I think) and has much better functionality, while remaining relatively easy to use. Example:
52> from urllib import urlopen
53> from lxml import etree
54> f = urlopen( "http://www.google.com" ).read()
55> tree = etree.HTML( f )
61> m = tree.xpath( "//meta" )
62> for i in m:
..> print etree.tostring( i )
..>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-2"/>
Edit: another example.
75> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
76> tree = etree.HTML( f )
85> tree.xpath( "//meta[#name='Keywords']" )[0].get("content")
85> "xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql,colors,soap,php,authoring,programming,training,learning,b
eginner's guide,primer,lessons,school,howto,reference,examples,samples,source code,tags,demos,tips,links,FAQ,tag list,forms,frames,color table,w3c,cascading
style sheets,active server pages,dynamic html,internet,database,development,Web building,Webmaster,html guide"
BTW: XPath is worth knowing.
Another edit:
Alternatively, you can just use regexp:
87> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
88> import re
101> re.search( "<meta name=\"Keywords\".*?content=\"([^\"]*)\"", f ).group( 1 )
101>"xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql, ...etc...
...but I find it less readable and more error prone (but involves only standard module and still fits on one line).
BeautifulSoup is a great way to parse HTML with Python.
Particularly, check out the findAll method:
http://www.crummy.com/software/BeautifulSoup/documentation.html
Why not use a regular expression
keywordregex = re.compile('<meta\sname=
["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
keywordlist = keywordregex.findall(html)
if len(keywordlist) > 0:
keywordlist = keywordlist[0]
keywordlist = keywordlist.split(", ")