I have problems achieving the following in Python. I am making API requests in a for loop, and would like to implement a status code check and a pause based on the status code, to make sure that the code will run without errors. So for example, I am after something like:
for i in X:
url = 'abc' + i
r = requests.get(url)
while r.status_code = 123:
sleep(1)
r = requests.get(url)
<the code I want to run that uses r>
How can I achieve this? Thanks in advance :)
Related
I have very basic knowledge of python, so sorry if my question sounds dumb.
I need to query a website for a personal project I am doing, but I need to query it 500 times, and each time I need to change 1 specific part of the url, then take the data and upload it to gsheets.
(The () signifies what part of the url I need to change)
'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=(symbol)&apikey=apikey'
I thought about using while and format {} to do it, but I was unsure how to change the string each time, bar writing out the names for variables by hand (defeating the whole purpose of this).
I already have a list of the symbols I need to use, but I don't know how to input them
Example of how I get 1 piece of data
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=MMM&apikey=demo'
r = requests.get(url)
data = r.json()
Example of what I'd like to change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=AOS&apikey=demo'
r = requests.get(url)
data = r.json()
#then change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=ABT&apikey=demo'
r = requests.get(url)
data = r.json()
so on and so forth, 500 times.
You might combine .format with for loop, consider following simple example
symbols = ["abc","xyz","123"]
for s in symbols:
url = 'https://www.example.com?symbol={}'.format(s)
print(url)
output
https://www.example.com?symbol=abc
https://www.example.com?symbol=xyz
https://www.example.com?symbol=123
You might also elect to use any other way of formatting, e.g. f-string (requires python3.6 or newer) in which case code would be
symbols = ["abc","xyz","123"]
for s in symbols:
url = f'https://www.example.com?symbol={s}'
print(url)
Alternatively you might params optional argument of requests.get function as follows
import requests
symbols = ["abc","xyz","123"]
for s in symbols:
r = requests.get('https://www.example.com', params={'symbol':s})
print(r.url)
output
https://www.example.com/?symbol=abc
https://www.example.com/?symbol=xyz
https://www.example.com/?symbol=123
I'm trying to get the dividend information from morningstar.
The following code works for scraping info from finviz but the dividend information is not the same as my broker platform.
symbol = 'bxs'
morningstar_url = 'https://www.morningstar.com/stocks/xnys/' + symbol + '/dividends'
http = urllib3.PoolManager()
response = http.request('GET', morningstar_url)
soup = BeautifulSoup(response.data, 'lxml')
html = list(soup.children)[1]
[type(item) for item in list(soup.children)]
def display_elements(L, show = 0):
test = list(L.children)
if(show):
for i in range(len(test)):
print(i)
print(test[i])
print()
return(test)
test = display_elements(html,1)
I have no issue printing out the elements but cannot find the element that houses the information such as "Total Yield %" of 2.8%. How do I get inside the mds-data-table to extract the information?
Great question! I've actually worked on this specifically, but years ago. Morningstar will only load the tables after running a script to prevent this exact type of scraping behavior. If you view source generally, immediately on load, you won't be able to see any HTML.
What your going to want to do is find the JavaScript code that is loading the elements, and hook up bs4 to use that. You'll have to poke around the files, but somewhere deep in those js files, you'll find a dynamic URL. It'll be hidden, but it'll be in there somewhere. I'll go look at some of my old code and see if i can find something that helps.
So here's an edited sample of what used to work for me:
from urllib.request import urlopen
exchange = 'NYSE'
ticker = 'V'
if exchange == 'NYSE':
exchange_code = "XNYS"
elif exchange in ["NasdaqNM", "NASDAQ"]:
exchange_code = "XNAS"
else:
logging.info("Unknown Exchange Code for {}".format(stock.symbol))
return
time_now = int(time.time())
time_delay = int(time.time()+150)
morningstar_raw = urlopen(f'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t={exchange_code}:{ticker}®ion=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=asc&columnYear=5&rounding=3&view=raw&r=354589&callback=jsonp{time_now}&_={time_delay}')
print(morningstar_raw)
Granted this solution is from a file lasted edited sometime in 2018, and they may have changed up their scripting, but you can find this and much more on my github project wxStocks
i will use a while loop for a refresh for a method.
def usagePerUserApi():
while True:
url = ....
resp = requests.get(url, headers=headers, verify=False)
data = json.loads(resp.content)
code = resp.status_code
Verbindungscheck.ausgabeVerbindungsCode(code)
head =.....
table = []
for item in (data['data']):
if item['un'] == tecNo:
table.append([
item['fud'],
item['un'],
str(item['lsn']),
str(item['fns']),
str(item['musage'])+"%",
str(item['hu']),
str(item['mu']),
str(item['hb']),
str(item['mb'])
])
print(tabulate(table,headers=head, tablefmt="github"))
time.sleep(300)
If I leave time.sleep like this, it will be displayed as an error. If I put it under the while loop, It will be updated constantly and does not wait 5 minutes.
I don't know where the mistake is. I hope you can help me.
You need to import the python time library
If you place
import time
at the top of your file it should work
Have you imported the time library? If not, then add
import time
to the top of your code, and it should work.
Also bear in mind that there may be problems with output buffering, where the program won't wait as expected, and so you'll need to turn it off, as shown by this answer.
I am trying to download books from "http://www.gutenberg.org/". I want to know why my code gets nothing.
import requests
import re
import os
import urllib
def get_response(url):
response = requests.get(url).text
return response
def get_content(html):
reg = re.compile(r'(<span class="mw-headline".*?</span></h2><ul><li>.*</a></li></ul>)',re.S)
return re.findall(reg,html)
def get_book_url(response):
reg = r'a href="(.*?)"'
return re.findall(reg,response)
def get_book_name(response):
reg = re.compile('>.*</a>')
return re.findall(reg,response)
def download_book(book_url,path):
path = ''.join(path.split())
path = 'F:\\books\\{}.html'.format(path) #my local file path
if not os.path.exists(path):
urllib.request.urlretrieve(book_url,path)
print('ok!!!')
else:
print('no!!!')
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
book_url = get_book_url(i)
if book_url:
book_name = get_book_name(i)
try:
download_book(book_url[0],book_name[0])
except:
continue
def main():
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main()
I have run the code and get nothing, no tracebacks. How can I download the books automatically from the website?
I have run the code and get nothing,no tracebacks.
Well, there's no chance you get a traceback in the case of an exception in download_book() since you explicitely silent them:
try:
download_book(book_url[0],book_name[0])
except:
continue
So the very first thing you want to do is to at least print out errors:
try:
download_book(book_url[0],book_name[0])
except exception as e:
print("while downloading book {} : got error {}".format(book_url[0], e)
continue
or just don't catch exception at all (at least until you know what to expect and how to handle it).
I don't even know how to fix it
Learning how to debug is actually even more important than learning how to write code. For a general introduction, you want to read this first.
For something more python-specific, here are a couple ways to trace your program execution:
1/ add print() calls at the important places to inspect what you really get
2/ import your module in the interactive python shell and test your functions in isolation (this is easier when none of them depend on global variables)
3/ use the builtin step debugger
Now there are a few obvious issues with your code:
1/ you don't test the result of request.get() - an HTTP request can fail for quite a few reasons, and the fact you get a response doesn't mean you got the expected response (you could have a 400+ or 500+ response as well.
2/ you use regexps to parse html. DONT - regexps cannot reliably work on html, you want a proper HTML parser instead (BeautifulSoup is the canonical solution for web scraping as it's very tolerant). Also some of your regexps look quite wrong (greedy match-all etc).
start_url is not defined in main()
You need to use a global variable. Otherwise, a better (cleaner) approach is to pass in the variable that you are using. In any case, I would expect an error, start_url is not defined
def main(start_url):
get_url_name(start_url)
if __name__ == '__main__':
start_url = 'http://www.gutenberg.org/wiki/Category:Classics_Bookshelf'
main(start_url)
EDIT:
Nevermind, the problem is in this line: content = get_content(get_response(start_url))
The regex in get_content() does not seem to match anything. My suggestion would be to use BeautifulSoup, from bs4 import BeautifulSoup. For any information regarding why you shouldn't parse html with regex, see this answer RegEx match open tags except XHTML self-contained tags
Asking regexes to parse arbitrary HTML is like asking a beginner to write an operating system
As others have said, you get no output because your regex doesn't match anything. The text returned by the initial url has got a newline between </h2> and <ul>, try this instead:
r'(<span class="mw-headline".*?</span></h2>\n<ul><li>.*</a></li></ul>)'
When you fix that one, you will face another error, I suggest some debug printouts like this:
def get_url_name(start_url):
content = get_content(get_response(start_url))
for i in content:
print('[DEBUG] Handling:', i)
book_url = get_book_url(i)
print('[DEBUG] book_url:', book_url)
if book_url:
book_name = get_book_name(i)
try:
print('[DEBUG] book_url[0]:', book_url[0])
print('[DEBUG] book_name[0]:', book_name[0])
download_book(book_url[0],book_name[0])
except:
continue
I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. Thanks is advance. Here is my erroneous code:
import requests ; from lxml import html
import asyncio
link = "http://quotes.toscrape.com/"
async def quotes_scraper(base_link):
response = requests.get(base_link)
tree = html.fromstring(response.text)
for titles in tree.cssselect("span.tag-item a.tag"):
processing_docs(base_link + titles.attrib['href'])
async def processing_docs(base_link):
response = requests.get(base_link).text
root = html.fromstring(response)
for soups in root.cssselect("div.quote"):
quote = soups.cssselect("span.text")[0].text
author = soups.cssselect("small.author")[0].text
print(quote, author)
next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
if next_page:
page_link = link + next_page
processing_docs(page_link)
loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()
Upon execution what I see in the console is:
RuntimeWarning: coroutine 'processing_docs' was never awaited
processing_docs(base_link + titles.attrib['href'])
You need to call processing_docs() with await.
Replace:
processing_docs(base_link + titles.attrib['href'])
with:
await processing_docs(base_link + titles.attrib['href'])
And replace:
processing_docs(page_link)
with:
await processing_docs(page_link)
Otherwise it tries to run an asynchronous function synchronously and gets upset!