Python bs4 lxml parsing slow - python

while parsing with bs4,lxml and looping trough my files with ThreadPoolExecutor threading I am experiencing really slow results. I have searched the whole internet for faster alternatives on this one. The parsing of about 2000 cached files (1.2mb each) takes about 15 minutes (max_workes=500) on ThreadPoolExecutor. I even tried parsing on Amazon AWS with 64 vCPU but the speed remains about the same.
I want to parse about 100k files which will takes hours of parsing. Why isn't the parsing not efficiently speeding up while multiprocessing? One file takes about 2seconds. Why is the speed of 10 files with (max_workes=10) not equaling 2 seconds as well since the threads are concurrent? Ok maybe 3 seconds would be fine. But it takes ages the more files there are, the more workers I assign to the threads. It get's to the point of about ~ 25 seconds per file instead of 2 seconds when running a sinlge file/thread. Why?
What can I do to get the desired 2-3 seconds per file while multiprocessing?
If not possible, any faster solutions?
My approch for the parsing is the following:
with open('cache/'+filename, 'rb') as f:
s = BeautifulSoup(f.read(), 'lxml')
s.whatever()
Any faster way to scrape my cached files?
// the multiprocessor:
from concurrent.futures import ThreadPoolExecutor, as_completed
future_list = []
with ThreadPoolExecutor(max_workers=500) as executor:
for filename in os.listdir("cache/"):
if filename.endswith(".html"):
fNametoString = str(filename).replace('.html','')
x = fNametoString.split("_")
EAN = x[0]
SKU = x[1]
future = executor.submit(parser,filename,EAN,SKU)
future_list.append(future)
else:
pass
for f in as_completed(future_list):
pass

Try:
from bs4 import BeautifulSoup
from multiprocessing import Pool
def worker(filename):
with open(filename, "r") as f_in:
soup = BeautifulSoup(f_in.read(), "html.parser")
# do some processing here
return soup.h1.text.strip()
if __name__ == "__main__":
filenames = ["page1.html", "page2.html", ...] # you can use glob module or populate the filenames list other way
with Pool(4) as pool: # 4 is number of processes
for result in pool.imap_unordered(worker, filenames):
print(result)

Related

Async for loop via Asyncio in Python

I have some short code that is built with a for loop to iterate through a list of image URLs and download/save the files. This code works fine, but the list of URLs is too large and would take forever to complete one check at a time.
My goal is to make this a asynchronous for loop in hopes that will speed things up greatly, but I just started writing python to build this and don't know enough to utilize the Asyncio library - can't build out the iterations through aiter. How can I get this running?
To summarize: I have a for loop and need to make it asynchronous so it can handle multiple iterations simultaneously (setting up a limit for the amount of async loops would be awesome too).
import pandas as pd
import requests
import asyncio
df = pd.read_excel(r'filelocation', sheet_name='Sheet2')
for index, row in df.iterrows():
url = row[0]
filename = url.split('/')[-1]
r = requests.get(url, allow_redirects=False)
open('filelocation' + filename, 'wb').write(r.content)
r.close()

How can I optimize a web-scraping code snippet to run faster?

I wrote this piece of code and it is currently running, scraping a massive amount of data. So far the loop has run 800 times. It will have to run ~16,000 times to grab all of the data.
Generally speaking, how can I optimize web-scraping code, or am I at the mercy of requests.get?
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
headers = {}
p = {}
a = int(p['page'])
df = pd.DataFrame()
while True:
p['page'] = str(a)
try:
a += 1
r = requests.get('URL',headers=headers, params=p)
complete_json = r.json()
print('success')
df_data = pd.DataFrame.from_dict(json_normalize(complete_json['explore_vintage']['matches']), orient='columns')
df = df.append(df_data)
except:
False
df.to_excel('output.xlsx', encoding='utf8')
df.to_csv("output.csv")
print(df.head)
There are a couple of optimizations I can see right off the bat.
The first thing you could add here is parallel processing via async requests. The requests library is synchronous and as you are seeing – it's going to block until each page fully processes. There are a number of libraries that the requests project officially recommends. If you go this route you'll need to more explicitly define a terminating condition rather than a try/except block inside an infinite while loop.
This is all pseudo-code primarily ripped from their examples, but you can see how this might work:
from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
import json
import time
def response_hook(resp, *args, **kwargs):
with open(f'tmp/{time.thread_time_ns()}.json', 'wb') as fp:
parsed = resp.json()
fp.write(json.dumps(parsed).encode('utf-8'))
futures_session = FuturesSession()
futures_session.hooks['response'] = response_hook
with futures_session as session:
futures = [
session.get(f'https://jsonplaceholder.typicode.com/todos/{i}', hooks={'response': response_hook}) for i in range(16000)
]
for future in as_completed(futures):
resp = future.result()
The parsing of the data into a dataframe is an obvious bottleneck. This is currently going to continue slowing down as the dataframe becomes larger and larger. I don't know the size of these JSON responses but if you're fetching 16k responses I imagine this would quickly grind to a halt once you've eaten through your memory. If possible, I would recommend decoupling the scraping and transforming operations. Save all of your scraped data into their own, independent JSON files (as in the example above). If you save each response separately and the scraping completes you can then loop over all of the saved contents, parse them, then output to Excel and CSV. Note that depending on the size of the JSON files you may still run into memory issues, you at least won't block the scraping process and can deal with the output processing separately.

Download a large number of multiple files

I have a set of image url with index, now I want to parse it through a downloader that can download multiple files at a time to speed up the process.
I tried to put the file name and URL to dicts(name and d2 respectively) and then use requests and threading to do that:
def Thread(start,stop):
for i in range(start, stop):
url = d2[i]
r = requests.get(url)
with open('assayImage/{}'.format(name[i]), 'wb') as f:
f.write(r.content)
for n in range(0, len(d2), 1500):
stop = n + 1500 if n +1500 <= len(d2) else len(d2)
threading.Thread(target = Thread, args = (n,stop)).start()
However, sometimes the connection is timed out and that file will not be downloaded, and after a while, the download speed decreases dramatically. For example, for the first 1 hour, I can download 10000 files, but 3 hours later I can only download 8000 files. Each file size is small, around 500KB.
So, I want to ask that is there any stable way to download a large number of multiple files? I really appreciate your answer.

Downloading thousands of files using python

I'm trying to download about 500k small csv files (5kb-1mb) from a list of urls but it is been taking too long to get this done. With the code bellow, I am lucky if I get 10k files a day.
I have tried using the multiprocessing package and a pool to download multiple files simultaneously. This seems to be effective for the first few thousand downloads, but eventually the overall speed goes down. I am no expert, but I assume that the decreasing speed indicates that the server I am trying to download from cannot keep up with this number of requests. Does that makes sense?
To be honest, I am quite lost here and was wondering if there is any piece of advice on how to speed this up.
import urllib2
import pandas as pd
import csv
from multiprocessing import Pool
#import url file
df = pd.read_csv("url_list.csv")
#select only part of the total list to download
list=pd.Series(df[0:10000])
#define a job and set file name as the id string under urls
def job(url):
file_name = str("test/"+url[46:61])+".csv"
u = urllib2.urlopen(url)
f = open(file_name, 'wb')
f.write(u.read())
f.close()
#run job
pool = Pool()
url = [ "http://" + str(file_path) for file_path in list]
pool.map(job, url)
You are re-coding the wheel!
How about that :
parallel -a urls.file axel
Of course you'll have to install parallel and axel for your distribution.
axel is a multithreaded counterpart to wget
parrallel allows you to run tasks using multithreading.

Faster parsing with Python

I'm trying to parse data from one web page. This web page allows you (according to robots.txt) to send 2000 requests per minute.
The problem is that everything I tried is too slow. The response of this server is quite quick.
from multiprocessing.pool import ThreadPool as Pool
import datetime
import lxml.html as lh
from bs4 import BeautifulSoup
import requests
with open('products.txt') as f:
lines = f.readlines()
def update(url):
html = requests.get(url).content # 3 seconds
doc = lh.parse(html) # almost 12 seconds (with commented line below)
soup = BeautifulSoup(html) # almost 12 seconds (with commented line above)
pool = Pool(10)
for line in lines[0:100]:
pool.apply_async(update, args=(line[:-1],))
pool.close()
now = datetime.datetime.now()
pool.join()
print datetime.datetime.now() - now
As I commented into the code - when I try to do just html = requests.get(url) for 100 urls, the time is great - under 3 seconds.
The problem is when I want to use some parser - the preprocessing of the html costs about 10 seconds and more which is too much.
What would you recommend me to lower the time?
EDIT: I tried to use SoupStrainer - it is slightly faster but nothing too much noticeable - 9 seconds.
html = requests.get(url).content
product = SoupStrainer('div',{'class': ['shopspr','bottom']})
soup = BeautifulSoup(html,'lxml', parse_only=product)
Depending on what you need to extract from the pages, perhaps you don't need the full DOM. Perhaps you could get away with HTMLParser(html.parser in Python3). It should be faster.
I would decouple getting the pages from parsing the pages, e.g. two Pools, one is getting the pages and filling a queue, where the other pool is getting pages from the queue and parsing them. This would use the available resources slightly better, but it wont be a big speed up. As a side effect should the server start serving pages with a bigger delay, you could still keep the workers busy with a big queue.

Categories

Resources