I have some short code that is built with a for loop to iterate through a list of image URLs and download/save the files. This code works fine, but the list of URLs is too large and would take forever to complete one check at a time.
My goal is to make this a asynchronous for loop in hopes that will speed things up greatly, but I just started writing python to build this and don't know enough to utilize the Asyncio library - can't build out the iterations through aiter. How can I get this running?
To summarize: I have a for loop and need to make it asynchronous so it can handle multiple iterations simultaneously (setting up a limit for the amount of async loops would be awesome too).
import pandas as pd
import requests
import asyncio
df = pd.read_excel(r'filelocation', sheet_name='Sheet2')
for index, row in df.iterrows():
url = row[0]
filename = url.split('/')[-1]
r = requests.get(url, allow_redirects=False)
open('filelocation' + filename, 'wb').write(r.content)
r.close()
Related
while parsing with bs4,lxml and looping trough my files with ThreadPoolExecutor threading I am experiencing really slow results. I have searched the whole internet for faster alternatives on this one. The parsing of about 2000 cached files (1.2mb each) takes about 15 minutes (max_workes=500) on ThreadPoolExecutor. I even tried parsing on Amazon AWS with 64 vCPU but the speed remains about the same.
I want to parse about 100k files which will takes hours of parsing. Why isn't the parsing not efficiently speeding up while multiprocessing? One file takes about 2seconds. Why is the speed of 10 files with (max_workes=10) not equaling 2 seconds as well since the threads are concurrent? Ok maybe 3 seconds would be fine. But it takes ages the more files there are, the more workers I assign to the threads. It get's to the point of about ~ 25 seconds per file instead of 2 seconds when running a sinlge file/thread. Why?
What can I do to get the desired 2-3 seconds per file while multiprocessing?
If not possible, any faster solutions?
My approch for the parsing is the following:
with open('cache/'+filename, 'rb') as f:
s = BeautifulSoup(f.read(), 'lxml')
s.whatever()
Any faster way to scrape my cached files?
// the multiprocessor:
from concurrent.futures import ThreadPoolExecutor, as_completed
future_list = []
with ThreadPoolExecutor(max_workers=500) as executor:
for filename in os.listdir("cache/"):
if filename.endswith(".html"):
fNametoString = str(filename).replace('.html','')
x = fNametoString.split("_")
EAN = x[0]
SKU = x[1]
future = executor.submit(parser,filename,EAN,SKU)
future_list.append(future)
else:
pass
for f in as_completed(future_list):
pass
Try:
from bs4 import BeautifulSoup
from multiprocessing import Pool
def worker(filename):
with open(filename, "r") as f_in:
soup = BeautifulSoup(f_in.read(), "html.parser")
# do some processing here
return soup.h1.text.strip()
if __name__ == "__main__":
filenames = ["page1.html", "page2.html", ...] # you can use glob module or populate the filenames list other way
with Pool(4) as pool: # 4 is number of processes
for result in pool.imap_unordered(worker, filenames):
print(result)
Im newbie in Python, this is my first work with REST API in python. First let me explain what i wanted to do. I have a csv file which have name of a product and some other details, these are missing data after migration. So now my job is to check in the downstream application1 if they contain these product or it is missing there too. if it is missing there should dig up back and back.
So Now I have API of Application 1(this would give the productname and details if that exists) and have an API for OAuth 2. This will create me a token and im using that token to access API of Application 1(it would look like this https://Applciationname/rest/< productname >) i get this < productname > from a list which is retrieved from first column of csv file. Everthing is working fine but my list is having 3000 entries it is taking almost 2 hours for me to complete.
Is there any fastest way to check this, BTW im calling token API only once. This is how my code looks like
list=[]
reading csv and appedning to list #using with open and csv reader here
get_token=requests.get(tokenurl,OAuthdetails) #similar type of code
token_dict=json.loads(get_token.content.decode())
token=token_dict['access_token']
headers={
'Authorization': 'Bearer'+' '+str(token)
}
url= https://Applciationname/rest/
for element in list:
full_url=url+element
api_response=requests.get(full_url,headers)
recieved_data=json.loads(api_response.content.decode())
if api_response.status_code=200 and len(recieved_data)!=0:
writing the element value to text file "successcall" text file #using with open here
else:
writing the element value to text file "failurecall" text file #using with open here
Now could you please help me optimizing this, so that ill be finding the product names which are not in APP 1 faster
You could see Threading for your for loop. Like so:
import threading
lock = threading.RLock()
thread_list = []
def check_api(full_url):
api_response=requests.get(full_url,headers)
recieved_data=json.loads(api_response.content.decode())
if api_response.status_code=200 and len(recieved_data)!=0:
# dont forget to add a lock to writing to the file
with lock:
with open("successcall.txt", "a") as f:
f.write(recieved_data)
else:
# again, dont forget to add with lock like the one above
# writing the element value to text file "failurecall" text file #using with open here
for element in list:
full_url = url+element
t = threading.Thread(target=check_api, args=(full_url, ))
thread_list.append(t)
# start all threads
for thread in thread_list:
thread.start()
# wait for them all to finish
for thread in thread_list:
thread.finish()
You should also not write to the same file while using Threads since it might cause some problems unless you use locks
I wrote this piece of code and it is currently running, scraping a massive amount of data. So far the loop has run 800 times. It will have to run ~16,000 times to grab all of the data.
Generally speaking, how can I optimize web-scraping code, or am I at the mercy of requests.get?
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
headers = {}
p = {}
a = int(p['page'])
df = pd.DataFrame()
while True:
p['page'] = str(a)
try:
a += 1
r = requests.get('URL',headers=headers, params=p)
complete_json = r.json()
print('success')
df_data = pd.DataFrame.from_dict(json_normalize(complete_json['explore_vintage']['matches']), orient='columns')
df = df.append(df_data)
except:
False
df.to_excel('output.xlsx', encoding='utf8')
df.to_csv("output.csv")
print(df.head)
There are a couple of optimizations I can see right off the bat.
The first thing you could add here is parallel processing via async requests. The requests library is synchronous and as you are seeing – it's going to block until each page fully processes. There are a number of libraries that the requests project officially recommends. If you go this route you'll need to more explicitly define a terminating condition rather than a try/except block inside an infinite while loop.
This is all pseudo-code primarily ripped from their examples, but you can see how this might work:
from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
import json
import time
def response_hook(resp, *args, **kwargs):
with open(f'tmp/{time.thread_time_ns()}.json', 'wb') as fp:
parsed = resp.json()
fp.write(json.dumps(parsed).encode('utf-8'))
futures_session = FuturesSession()
futures_session.hooks['response'] = response_hook
with futures_session as session:
futures = [
session.get(f'https://jsonplaceholder.typicode.com/todos/{i}', hooks={'response': response_hook}) for i in range(16000)
]
for future in as_completed(futures):
resp = future.result()
The parsing of the data into a dataframe is an obvious bottleneck. This is currently going to continue slowing down as the dataframe becomes larger and larger. I don't know the size of these JSON responses but if you're fetching 16k responses I imagine this would quickly grind to a halt once you've eaten through your memory. If possible, I would recommend decoupling the scraping and transforming operations. Save all of your scraped data into their own, independent JSON files (as in the example above). If you save each response separately and the scraping completes you can then loop over all of the saved contents, parse them, then output to Excel and CSV. Note that depending on the size of the JSON files you may still run into memory issues, you at least won't block the scraping process and can deal with the output processing separately.
I'm trying to download about 500k small csv files (5kb-1mb) from a list of urls but it is been taking too long to get this done. With the code bellow, I am lucky if I get 10k files a day.
I have tried using the multiprocessing package and a pool to download multiple files simultaneously. This seems to be effective for the first few thousand downloads, but eventually the overall speed goes down. I am no expert, but I assume that the decreasing speed indicates that the server I am trying to download from cannot keep up with this number of requests. Does that makes sense?
To be honest, I am quite lost here and was wondering if there is any piece of advice on how to speed this up.
import urllib2
import pandas as pd
import csv
from multiprocessing import Pool
#import url file
df = pd.read_csv("url_list.csv")
#select only part of the total list to download
list=pd.Series(df[0:10000])
#define a job and set file name as the id string under urls
def job(url):
file_name = str("test/"+url[46:61])+".csv"
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
#run job
pool = Pool()
url = [ "http://" + str(file_path) for file_path in list]
pool.map(job, url)
You are re-coding the wheel!
How about that :
parallel -a urls.file axel
Of course you'll have to install parallel and axel for your distribution.
axel is a multithreaded counterpart to wget
parrallel allows you to run tasks using multithreading.
So I have this and it's working:
import requests
import csv
my_list = [item1, item2, item3, item4]
def write_results(file, list_to_iterate):
with open(file, 'w') as f:
fields = ("column_item", "column_value")
wr = csv.DictWriter(f, delimiter=",", fieldnames=fields, lineterminator='\n')
wr.writeheader()
for each_item in list_to_iterate:
try:
r = requests.get('https://www.somewebsite.com/api/something?this='+each_item).json()
value = str(r['value'])
except:
value = "none"
wr.writerow({'column_item': each_item, 'column_value': value})
write_results('spreadsheet.csv', my_list)
I'm basically writing the outputs that I fetch from a JSON output page to a CSV. My function works great and operates exactly as intended. The only drawback is that I'm actually iterating through a huge list, so I'm having to send a ton of requests. I'm wondering if it is possible for me to send my requests in parallel rather than waiting for a response then sending the next one. The order in which I write to the CSV doesn't even matter, so it's not a problem if I retrieve responses out of sync with my list.
I've tried and looked at all these different methods, but I just cannot get my head around a solid solution. All the examples I've seen are more geared towards web scraping and they are using a total different set of modules (not using requests).
Any help is gladly appreciated. This is driving me nuts. Also, if it's possible, I'd like to use as many native modules as possible.
Note: Python 3.5