Multithreading to Scrape Yahoo Finance - python

I'm running a program to pull some info from Yahoo! Finance. It runs fine as a For loop, however it takes a long time (about 10 minutes for 7,000 inputs) because it has to process each request.get(url) individually (or am I mistaken on the major bottlenecker?)
Anyway, I came across multithreading as a potential solution. This is what I have tried:
import requests
import pprint
import threading
with open('MFTop30MinusAFew.txt', 'r') as ins: #input file for tickers
for line in ins:
ticker_array = ins.read().splitlines()
ticker = ticker_array
url_array = []
url_data = []
data_array =[]
for i in ticker:
url = 'https://query2.finance.yahoo.com/v10/finance/quoteSummary/'+i+'?formatted=true&crumb=8ldhetOu7RJ&lang=en-US&region=US&modules=defaultKeyStatistics%2CfinancialData%2CcalendarEvents&corsDomain=finance.yahoo.com'
url_array.append(url) #loading each complete url at one time
def fetch_data(url):
urlHandler = requests.get(url)
data = urlHandler.json()
data_array.append(data)
pprint.pprint(data_array)
threads = [threading.Thread(target=fetch_data, args=(url,)) for url in url_array]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
fetch_data(url_array)
The error I get is InvalidSchema: No connection adapters were found for '['https://query2.finance.... [url continues].
PS. I've also read that using multithread approach to scrape websites is bad/can get you blocked. Would Yahoo! Finance mind if I'm pulling data from a couple thousand tickers at once? Nothing happened when I did them sequentially.

If you look carefully at the error you will notice that it doesn't show one url but all the urls you appended, enclosed with brackets. Indeed the last line of your code actually call your method fetch_data with the full array as a parameter, which does't make sense. If you remove this last line the code runs just fine, and your threads are called as expected.

Related

Mulithreading hangs when used with requests module and large number of threads

I have about 2000 urls that I am trying to scrape using the requests module. To speed up the process, I am using the ThreadPoolExecutor from concurrent.futures. The execution hangs in the middle when I run this and the issue is inconsistent too. Sometimes, it finishes smoothly within 2 minutes but other times, it just gets stuck at a point for over 30 mins and I eventually have to kill the process.
# scraper.py
def get_content(url):
try:
res = requests.get(url)
res = res.content
return res
except:
return ""
# main.py
from scraper import get_content
if __name__ == "__main__":
# content > an empty list for output
# urls > a list of urls
with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
results = executor.map(get_content, urls)
for res in results:
content = content.append(res)
print(content)
I want to understand how to debug this. Why and where is it getting stuck? And also, why is it inconsistent?

How to send multiple 'GET' requests using get function of requests library? [duplicate]

This question already has answers here:
What is the fastest way to send 100,000 HTTP requests in Python?
(21 answers)
Closed 2 years ago.
I want to fetch data (JSON files only) from multiple URLs using requests.get(). The URLs are saved in a pandas dataframe column and I am saving the response in JSON files locally.
i=0
start = time()
for url in pd_url['URL']:
time_1 = time()
r_1 = requests.get(url, headers = headers).json()
filename = './jsons1/'+str(i)+'.json'
with open(filename, 'w') as f:
json.dump(r_1, f)
i+=1
time_taken = time()-start
print('time taken:', time_taken)
Currently, I have written code to get data one by one from each URL using for loop as shown above. However, that code is taking too much time to execute. Is there any way to send multiple requests at once and make this thing run faster?
Also, What are the possible factors that are delaying the responses?
I have an internet connection with low latency and enough speed to 'theoretically' execute above operation in less than 20 seconds. Still, the above code takes 145-150 seconds every time I run it. My target is to complete this execution in maximum 30 seconds. Please suggest workarounds.
It sounds like you want multi-threading so use the ThreadPoolExecutor in the standard library. This can be found in the concurrent.futures package.
import concurrent.futures
def make_request(url, headers):
resp = requests.get(url, headers=headers).json()
return resp
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = (executor.submit(make_request, url, headers) for url in pd_url['URL'])
for idx, future in enumerate(concurrent.futures.as_completed(futures)):
try:
data = future.result()
except Exception as exc:
print(f"Generated an exception: {exc}")
with open(f"./jsons1/{idx}.json", 'w') as f:
json.dump(data, f)
You can increase or decrease the number of threads, specified as max_workers, as you see fit.
You can make use of multiple threads to parallelize your fetching. This article presents one possible way of doing that using the ThreadPoolExecutor class from the concurrent.futures module.
It looks like #gold_cy posted pretty much the same answer while I was working on this, but for posterity, here's my example. I've taken your code and modified it to use the executor, and I've modified it slightly to run locally despite not having handy access to a list of JSON urls.
I'm using a list of 100 URLs, and it takes about 125 seconds to fetch the list serially, and about 27 seconds using 10 workers. I added a timeout on requests to prevent broken servers from holding everything up, and I added some code to handle errors responses.
import json
import pandas
import requests
import time
from concurrent.futures import ThreadPoolExecutor
def fetch_url(data):
index, url = data
print('fetching', url)
try:
r = requests.get(url, timeout=10)
except requests.exceptions.ConnectTimeout:
return
if r.status_code != 200:
return
filename = f'./data/{index}.json'
with open(filename, 'w') as f:
json.dump(r.text, f)
pd_url = pandas.read_csv('urls.csv')
start = time.time()
with ThreadPoolExecutor(max_workers=10) as runner:
for _ in runner.map(fetch_url, enumerate(pd_url['URL'])):
pass
runner.shutdown()
time_taken = time.time()-start
print('time taken:', time_taken)
Also, What are the possible factors that are delaying the responses?
The response time of the remote server is going to be the major bottleneck.

Why am I getting an empty array when I run a requests.get and a json a second time in the same code set?

I'm running a program to pull information from Google Maps API. The API only loads 20 responses (max 60) at a time so I have to send an initial request, and then a second and third request to get all the data. My issues is that the second request seems to work (I'm getting returned), but when I try to parse it using json, I'm getting an empty array. The code works in the first iteration, but doesn't work the second time.
First iteration:
response1 = requests.get(url1)
print(response1)
results1 = response1.json()['results']
print(results1)
jj1 = json.loads(response1.text)
print(jj1)
Second iteration:
if 'next_page_token' in jj1:
next_page_token1 = jj1['next_page_token']
url2 = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json?key='+str(api_key)+'&pagetoken='+str(next_page_token1)
print(url2)
response2 = requests.get(url2)
print(response2)
**results2** = response2.json()['results'] *results2 is []
print(results2)
jj2 = json.loads(response2.text)
print(jj2)
I was able to solve the issue by adding a pause in the execution of the code. There are 3 Google Maps API requests in my code. If they process without a pause, I get an error message. Once I added the following module and code, the code worked.
enter code here
import time
time.sleep(2)

Webscrape multithread python 3

i have been dong a simple webscraping program to learn how to code and i made it work but i wanted to see how to make it faster. I wanted to ask how could i implement multi-threading to this program? all that the program does is open the stock symbols file and searches for the price for that stock online.
Here is my code
import urllib.request
import urllib
from threading import Thread
symbolsfile = open("Stocklist.txt")
symbolslist = symbolsfile.read()
thesymbolslist = symbolslist.split("\n")
i=0
while i<len (thesymbolslist):
theurl = "http://www.google.com/finance/getprices?q=" + thesymbolslist[i] + "&i=10&p=25m&f=c"
thepage = urllib.request.urlopen(theurl)
# read the correct character encoding from `Content-Type` request header
charset_encoding = thepage.info().get_content_charset()
# apply encoding
thepage = thepage.read().decode(charset_encoding)
print(thesymbolslist[i] + " price is " + thepage.split()[len(thepage.split())-1])
i= i+1
If you just iterate a function on a list, i recommend you the multiprocessing.Pool.map(function, list).
https://docs.python.org/3/library/multiprocessing.html?highlight=multiprocessing%20map#multiprocessing.pool.Pool.map
You need to use asyncio. That's quite neat package that could also help you with scrapping. I have created a small snippet of how to integrate with linkedin with asyncio but you can adopt it to your needs quite easily.
import asyncio
import requests
def scrape_first_site():
url = 'http://example.com/'
response = requests.get(url)
def scrape_another_site():
url = 'http://example.com/other/'
response = requests.get(url)
loop = asyncio.get_event_loop()
tasks = [
loop.run_in_executor(None, scrape_first_site),
loop.run_in_executor(None, scrape_another_site)
]
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
Since default executor is ThreadPoolExecutor it will run each task in the separate thread. You can use ProcessPoolExecutor if you'd like to run tasks in process rather than threads (GIL related issues maybe).

Asynchronously get and store images in python

The following code is a sample of non-asynchronous code, is there any way to get the images asynchronously?
import urllib
for x in range(0,10):
urllib.urlretrieve("http://test.com/file %s.png" % (x), "temp/file %s.png" % (x))
I have also seen the Grequests library but I couldn't figure much if that is possible or how to do it from the documentation.
You don't need any third party library. Just create a thread for every request, start the threads, and then wait for all of them to finish in the background, or continue your application while the images are being downloaded.
import threading
results = []
def getter(url, dest):
results.append(urllib.urlretreave(url, dest))
threads = []
for x in range(0,10):
t = threading.Thread(target=getter, args=('http://test.com/file %s.png' % x,
'temp/file %s.png' % x))
t.start()
threads.append(t)
# wait for all threads to finish
# You can continue doing whatever you want and
# join the threads when you finally need the results.
# They will fatch your urls in the background without
# blocking your main application.
map(lambda t: t.join(), threads)
Optionally you can create a thread pool that will get urls and dests from a queue.
If you're using Python 3 it's already implemented for you in the futures module.
Something like this should help you
import grequests
urls = ['url1', 'url2', ....] # this should be the list of urls
requests = (grequests.get(u) for u in urls)
responses = grequests.map(requests)
for response in responses:
if 199 < response.status_code < 400:
name = generate_file_name() # generate some name for your image file with extension like example.jpg
with open(name, 'wb') as f: # or save to S3 or something like that
f.write(response.content)
Here only the downloading of images would be parallel but writing each image content to a file would be sequential so you can create a thread or do something else to make it parallel or asynchronous

Categories

Resources