I am trying to create a ThreadPoolExecutor to pull data from a server, however when the executor runs it will make many requests at the same time to the server, but the server can only handle 4 requests at a time. So my task is to make the ThreadPoolExecutor only make 4 requests at any time. Below I have a minimum working example, but how can I make it to limit the total number of requests at a given time?
import requests
import concurrent.futures
# Files
files = [r'https://data.pacificclimate.org/data/downscaled_gcms/tasmin_day_BCCAQv2+ANUSPLIN300_CanESM2_historical+rcp45_r1i1p1_19500101-21001231.nc.nc?tasmin[0:55114][152:152][290:290]',
r'https://data.pacificclimate.org/data/downscaled_gcms/tasmin_day_BCCAQv2+ANUSPLIN300_CNRM-CM5_historical+rcp45_r1i1p1_19500101-21001231.nc.nc?tasmin[0:55114][152:152][290:290]',
r'https://data.pacificclimate.org/data/downscaled_gcms/tasmin_day_BCCAQv2+ANUSPLIN300_CSIRO-Mk3-6-0_historical+rcp45_r1i1p1_19500101-21001231.nc.nc?tasmin[0:55114][152:152][290:290]',
r'https://data.pacificclimate.org/data/downscaled_gcms/tasmin_day_BCCAQv2+ANUSPLIN300_CCSM4_historical+rcp45_r2i1p1_19500101-21001231.nc.nc?tasmin[0:55114][152:152][290:290]',
r'https://data.pacificclimate.org/data/downscaled_gcms/tasmin_day_BCCAQv2+ANUSPLIN300_MIROC5_historical+rcp45_r3i1p1_19500101-21001231.nc.nc?tasmin[0:55114][152:152][290:290]',
r'https://data.pacificclimate.org/data/downscaled_gcms/tasmin_day_BCCAQv2+ANUSPLIN300_MPI-ESM-LR_historical+rcp45_r3i1p1_19500101-21001231.nc.nc?tasmin[0:55114][152:152][290:290]',
r'https://data.pacificclimate.org/data/downscaled_gcms/tasmin_day_BCCAQv2+ANUSPLIN300_MRI-CGCM3_historical+rcp45_r1i1p1_19500101-21001231.nc.nc?tasmin[0:55114][152:152][290:290]',
r'https://data.pacificclimate.org/data/downscaled_gcms/tasmin_day_BCCAQv2+ANUSPLIN300_GFDL-ESM2G_historical+rcp45_r1i1p1_19500101-21001231.nc.nc?tasmin[0:55114][152:152][290:290]',
r'https://data.pacificclimate.org/data/downscaled_gcms/tasmin_day_BCCAQv2+ANUSPLIN300_HadGEM2-ES_historical+rcp45_r1i1p1_19500101-21001231.nc.nc?tasmin[0:55114][152:152][290:290]]',]
# List of Climate Model Names Corresponding To Files
climate_model = ['CanESM2', 'CNRM-CM5','CSIRO-Mk3-6-0','CCSM4','MIROC5','MPI-ESM-LR','MRI-CGCM3','GFDL-ESM2G','HadGEM2-ES']
def min_function(url,climate_model):
r = requests.get(url)
filename = 'tasmin85_' + f'{climate_model}' + '.nc'
with open(filename,'wb') as f:
f.write(r.content)
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
executor.map(min_function, files, climate_model)
Related
I have a problem when trying to ingest data from s3 http link to python by using requests library.
My code is as follows:
import gzip
import requests
def parse(url: str):
r = requests.get(url, stream=True)
data = gzip.decompress(r.content)
raw_data = []
for line in data.iter_lines():
raw_data.append(j.loads(line.decode("utf-8")))
return raw_data
raw_data = parse('https://s3-eu-west-1.amazonaws.com/path/of/bucket.json.gz')
When I run this, the code is running without giving error but it doesn't end.
It looks like stuck. But the size of data 3.1 GB and I was not expecting too much.
(Actually I waited more than 1 hour)
What can be the problem? Is there a suggestion from you?
I have a Python program which executes below steps:
Look for .sql file present in particular folder
Create a List with all .sql file name
Create a database connection
Execute for loop for each file name present in list created in step 2.
Read .sql file
Execute query mentioned in .sql file against database
Export data in to file
Repeat step 4 for all 15 files
This works fine and as expected. However, each file is exported in serial fashion (one after another). Is there any way, I can start exporting all 15 files at a same time?
Yes, you can actually call all 15 files parallel. Here is an example. I am calling request 4 times with different parameters on a functions.
from concurrent.futures import ThreadPoolExecutor
import random,time
from bs4 import BeautifulSoup as bs
import requests
URL = 'http://quotesondesign.com/wp-json/posts'
def quote_stream():
'''
Quoter streamer
'''
param = dict(page=random.randint(1, 1000))
quo = requests.get(URL, params=param)
if quo.ok:
data = quo.json()
author = data[0]['title'].strip()
content = bs(data[0]['content'], 'html5lib').text.strip()
print(f'{content}\n-{author}\n')
else:
print('Connection Issues :(')
def multi_qouter(workers=4):
with ThreadPoolExecutor(max_workers=workers) as executor:
_ = [executor.submit(quote_stream) for i in range(workers)]
if __name__ == '__main__':
now = time.time()
multi_qouter(workers=4)
print(f'Time taken {time.time()-now:.2f} seconds')
The point is create a function that runs one file from start to finish(quote_stream). Then call that function with different files in different threads(multi_qouter). For a function that takes parameters as yours, you just place them [executor.submit(quote_stream,file) for file in files] and set max_workers=len(files), where files is a list of your sql files to be passed in that function.
I have one web application with Rest API ,In that application I have some video files now i am creating one intermediate server in this one.I am accessing my web content using API but here i need to check updates in regular intervals if new updates will be there i need to download them.
Here files are video files and i'm using flask
I tried this i'm not getting
from flask import Flask,render_template,json,jsonify
import schedule
import time
import requests,json
from pathlib import Path
import multiprocessing
import time
import sys
import schedule,wget,requests,json,os,errno,shutil,time,config
def get_videos():
response = requests.get('my api here')
data = response.json()
files =list() # collecting my list of video files
l = len(data)
for i in range(l):
files.append(data[i]['filename'])
return files
def checkfor_file(myfiles):
for i in range(len(myfiles)):
url = 'http://website.com/static/Video/'+myfiles[i] # checking file exists are not in my folder
if url:
os.remove(url)
else:
pass
def get_newfiles(myfiles):
for i in range(len(myfiles)):
url = config.videos+myfiles[i]
filename= wget.download(url)# downloading video files here
def move_files(myfiles):
for i in range(len(myfiles)):
file = myfiles[i]
shutil.move(config.source_files+file,config.destinatin) # moving download folder to another folder
def videos():
files = set(get_videos()) # getting only unique file only
myfiles = list(files)
checkfor_file(myfiles)
get_newfiles(myfiles)
move_files(myfiles)
def job():
videos()
schedule.every(10).minutes.do(job) # running for every ten minutes
while True:
schedule.run_pending()
time.sleep(1)
pi =Flask(__name__)
#pi.route('/')
def index():
response = requests.get('myapi')
data = response.json()
return render_template('main.html',data=data)
I have a file to download (download path extracted from json. eg: http://testsite/abc.zip).
I need a help to perform, all the 5 threads should download the abc.zip file to the output directory and the download has to be Asynchronous or concurrent.
Currently with the below code it does download the file 5 times but it downloads one by one (Synchronous).
What I want is, the download to be simultaneous.
def dldr(file=file_url, outputdir=out1):
local_fn = str(uuid.uuid4())
if not os.path.exists(outputdir):
os.makedirs(outputdir)
s = datetime.now()
urllib.urlretrieve(file, outputdir + os.sep + local_fn)
e = datetime.now()
time_diff = e - s
logger(out1, local_fn, time_diff)
for i in range(1, 6):
t = threading.Thread(target=dldr())
t.start()
I have read Requests with multiple connections post and it's helpful, but doesn't address the requirement of the question asked.
I use threading module for download threads:
Also requests, but you can change that to urllib by yourself.
import threading
import requests
def download(link, filelocation):
r = requests.get(link, stream=True)
with open(filelocation, 'wb') as f:
for chunk in r.iter_content(1024):
if chunk:
f.write(chunk)
def createNewDownloadThread(link, filelocation):
download_thread = threading.Thread(target=download, args=(link,filelocation))
download_thread.start()
for i in range(0,5):
file = "C:\\test" + str(i) + ".png"
print file
createNewDownloadThread("http://stackoverflow.com/users/flair/2374517.png", file)
I've requirement to calculate the total number of post requests sent to a server. My script uses a thread per JSON file which contains post data. Below is the rough code snippet.
statistics = 0
def load_from_file(some_arguments, filename):
data_list = json.loads(open(filename).read())
url = address + getUrl(filename, config)
for data in data_list.get("results"):
statistics += 1
r = requests.post(url, data=json.dumps(data), headers=headers,
auth=HTTPBasicAuth(username, password))
def load_from_directory(some_arguments, directory):
pool = mp.Pool(mp.cpu_count() * 2)
func = partial(load_from_file, some_arguments)
file_list = [f for f in listdir(directory) if isfile(join(directory, f))]
pool.map(func, [join(directory, f) for f in file_list ])
pool.close()
pool.join()
print "total post requests", statistics
I want to print the total number of post requests processed using this script. Is it the right way?
Sharing memory is not so simple when using multiprocesses. I'm not seeing the need to use the multiprocessing module instead of threading. Multiprocessing is mostly used as a workaround for the Global interpreter lock.
In your example you are using IO bound operations which probably won't ever reach full CPU time. If you insist on using multiprocess instead of threading I suggest to take a look at exchanging-objects-between-processes.
Otherwise using threading you can share the global statistics variable between threads.
import threading
statistics = 0
def load_from_file(some_arguments, filename):
global statistics
data_list = json.loads(open(filename).read())
url = address + getUrl(filename, config)
for data in data_list.get("results"):
statistics += 1
r = requests.post(url, data=json.dumps(data), headers=headers,
auth=HTTPBasicAuth(username, password))
def load_from_directory(some_arguments, directory):
threads = []
func = partial(load_from_file, some_arguments)
file_list = [f for f in listdir(directory) if isfile(join(directory, f))]
for f in file_list:
t = threading.Thread(target=func, args=(join(directory, f)))
t.start()
threads.append(t)
#Wait for threads to finish
for thread in threads:
thread.join()
print "total post requests", statistics
Note: This currently simultaneously spawns threads based on the number of files in the directory. You might want to implement some kind of throttling for optimal performance.