How can I optimize a web-scraping code snippet to run faster?

How can I optimize a web-scraping code snippet to run faster? - python

I wrote this piece of code and it is currently running, scraping a massive amount of data. So far the loop has run 800 times. It will have to run ~16,000 times to grab all of the data.
Generally speaking, how can I optimize web-scraping code, or am I at the mercy of requests.get?
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
headers = {}
p = {}
a = int(p['page'])
df = pd.DataFrame()
while True:
p['page'] = str(a)
try:
a += 1
r = requests.get('URL',headers=headers, params=p)
complete_json = r.json()
print('success')
df_data = pd.DataFrame.from_dict(json_normalize(complete_json['explore_vintage']['matches']), orient='columns')
df = df.append(df_data)
except:
False
df.to_excel('output.xlsx', encoding='utf8')
df.to_csv("output.csv")
print(df.head)

There are a couple of optimizations I can see right off the bat.
The first thing you could add here is parallel processing via async requests. The requests library is synchronous and as you are seeing – it's going to block until each page fully processes. There are a number of libraries that the requests project officially recommends. If you go this route you'll need to more explicitly define a terminating condition rather than a try/except block inside an infinite while loop.
This is all pseudo-code primarily ripped from their examples, but you can see how this might work:
from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
import json
import time
def response_hook(resp, *args, **kwargs):
with open(f'tmp/{time.thread_time_ns()}.json', 'wb') as fp:
parsed = resp.json()
fp.write(json.dumps(parsed).encode('utf-8'))
futures_session = FuturesSession()
futures_session.hooks['response'] = response_hook
with futures_session as session:
futures = [
session.get(f'https://jsonplaceholder.typicode.com/todos/{i}', hooks={'response': response_hook}) for i in range(16000)
]
for future in as_completed(futures):
resp = future.result()
The parsing of the data into a dataframe is an obvious bottleneck. This is currently going to continue slowing down as the dataframe becomes larger and larger. I don't know the size of these JSON responses but if you're fetching 16k responses I imagine this would quickly grind to a halt once you've eaten through your memory. If possible, I would recommend decoupling the scraping and transforming operations. Save all of your scraped data into their own, independent JSON files (as in the example above). If you save each response separately and the scraping completes you can then loop over all of the saved contents, parse them, then output to Excel and CSV. Note that depending on the size of the JSON files you may still run into memory issues, you at least won't block the scraping process and can deal with the output processing separately.

Related

How to multiprocess from request to SQL (SQLite)

So I want to make requests to an api, transform the data somewhat and upload it to an SQLite database. Since this is generally a slow process, especially when the api may take around 5 seconds per request, I'm looking to use multiprocessing, mainly to send at least one new request while processing the previous. However, I tried some things and found it a bit tricky with the upload to SQL not to lose anything.
Simplified code example:
import sqlite3, requests
import pandas as pd
con = sqlite3.Connection('testDB')
cur = con.cursor()
urllist = [url1,url2,…urln]
def getData(url):
r = requests.get(url)
data = pd.read_json(r.text)
return data
def uploadData(row,cur,con):
cur.execute(f"INSERT OR REPLACE INTO test VALUES {row['valA']}, '{row['valB']}');")
con.commit()
def dataLoop(url,cur,con):
data = getData(url)
data.apply(uploadData,cur=cur,con=con,axis=1)
for url in urllist:
dataLoop(url,cur,con)
So, conceptually, I would like to know which kind of multiprocessing to use, which modules would fit this and how to make sure not to lose data (making sure to only work on new data once the old data has been fully processed and uploaded).

Async for loop via Asyncio in Python

I have some short code that is built with a for loop to iterate through a list of image URLs and download/save the files. This code works fine, but the list of URLs is too large and would take forever to complete one check at a time.
My goal is to make this a asynchronous for loop in hopes that will speed things up greatly, but I just started writing python to build this and don't know enough to utilize the Asyncio library - can't build out the iterations through aiter. How can I get this running?
To summarize: I have a for loop and need to make it asynchronous so it can handle multiple iterations simultaneously (setting up a limit for the amount of async loops would be awesome too).
import pandas as pd
import requests
import asyncio
df = pd.read_excel(r'filelocation', sheet_name='Sheet2')
for index, row in df.iterrows():
url = row[0]
filename = url.split('/')[-1]
r = requests.get(url, allow_redirects=False)
open('filelocation' + filename, 'wb').write(r.content)
r.close()

Reading csv files with glob to pass data to a database very slow

I have many csv files and I am trying to pass all the data that they contain into a database. For this reason, I found that I could use the glob library to iterate over all csv files in my folder. Following is the code I used:
import requests as req
import pandas as pd
import glob
import json
endpoint = "testEndpoint"
path = "test/*.csv"
for fname in glob.glob(path):
print(fname)
df = pd.read_csv(fname)
for index, row in df.iterrows():
#print(row['ID'], row['timestamp'], row['date'], row['time'],
# row['vltA'], row['curA'], row['pwrA'], row['rpwrA'], row['frq'])
print(row['timestamp'])
testjson = {"data":
{"installationid": row['ID'],
"active": row['pwrA'],
"reactive": row['rpwrA'],
"current": row['curA'],
"voltage": row['vltA'],
"frq": row['frq'],
}, "timestamp": row['timestamp']}
payload = {"payload": [testjson]}
json_data = json.dumps(payload)
response = req.post(
endpoint, data=json_data, headers=headers)
This code seems to work fine in the beginning. However, after some time it starts to become really slow (I noticed this because I print the timestamp as I upload the data) and eventually stops completely. What is the reason for this? Is something I am doing here really inefficient?

I can see 3 possible problems here:
memory. read_csv is fast, but it loads the content of a full file in memory. If the files are really large, you could exhaust the real memory and start using swap which has terrible performances
iterrows: you seem to build a dataframe - meaning a data structure optimized for column wise access - to then access it by rows. This already is a bad idea and iterrows is know to have terrible performances because it builds a Series per each row
one post request per row. An http request has its own overhead, but furthemore, this means that you add rows to the database one at a time. If this is the only interface for your database, you may have no other choice, but you should search whether it is possible to prepare a bunch of rows and load it as a whole. It often provides a gain of more than one magnitude order.
Without more info I can hardly say more, but IHMO the higher gain is to be found on database feeding so here in point 3. If nothing can be done on that point, of if further performance gain is required, I would try to replace pandas with the csv module which is row oriented and has a limited footprint because it only processes one line at a time whatever the file size.
Finally, and if it makes sense for your use case, I would try to use one thread for the reading of the csv file that would feed a queue and a pool of threads to send requests to the database. That should allow to gain the HTTP overhead. But beware, depending on the endpoint implementation it could not improve much if really the database access if the limiting factor.

Faster parsing with Python

I'm trying to parse data from one web page. This web page allows you (according to robots.txt) to send 2000 requests per minute.
The problem is that everything I tried is too slow. The response of this server is quite quick.
from multiprocessing.pool import ThreadPool as Pool
import datetime
import lxml.html as lh
from bs4 import BeautifulSoup
import requests
with open('products.txt') as f:
lines = f.readlines()
def update(url):
html = requests.get(url).content # 3 seconds
doc = lh.parse(html) # almost 12 seconds (with commented line below)
soup = BeautifulSoup(html) # almost 12 seconds (with commented line above)
pool = Pool(10)
for line in lines[0:100]:
pool.apply_async(update, args=(line[:-1],))
pool.close()
now = datetime.datetime.now()
pool.join()
print datetime.datetime.now() - now
As I commented into the code - when I try to do just html = requests.get(url) for 100 urls, the time is great - under 3 seconds.
The problem is when I want to use some parser - the preprocessing of the html costs about 10 seconds and more which is too much.
What would you recommend me to lower the time?
EDIT: I tried to use SoupStrainer - it is slightly faster but nothing too much noticeable - 9 seconds.
html = requests.get(url).content
product = SoupStrainer('div',{'class': ['shopspr','bottom']})
soup = BeautifulSoup(html,'lxml', parse_only=product)

Depending on what you need to extract from the pages, perhaps you don't need the full DOM. Perhaps you could get away with HTMLParser(html.parser in Python3). It should be faster.
I would decouple getting the pages from parsing the pages, e.g. two Pools, one is getting the pages and filling a queue, where the other pool is getting pages from the queue and parsing them. This would use the available resources slightly better, but it wont be a big speed up. As a side effect should the server start serving pages with a bigger delay, you could still keep the workers busy with a big queue.

Multithreaded screen scraping help needed

I'm relatively new to python, and I'm working through a screen- scraping application that gathers data from multiple financial sites. I have four procedures for now. Two run in just a couple minutes, and the other two... hours each. These two look up information on particular stock symbols that I have in a csv file. There are 4,000+ symbols that I'm using. I know enough to know that the vast majority of the time spent is in IO over the wire. It's essential that I get these down to 1/2 hour each (or, better. Is that too ambitious?) for this to be of any practical use to me. I'm using python 3 and BeautifulSoup.
I have the general structure of what I'm doing below. I've abbreviated conceptually non essential sections. I'm reading many threads on multiple calls/ threads at once to speed things up, and it seems like there are a lot of options. Can anyone point me in the right direction that I should pursue, based on the structure of what I have so far? It'd be a huge help. I'm sure it's obvious, but this procedure gets called along with the other data download procs in a main driver module. Thanks in advance...
from bs4 import BeautifulSoup
import misc modules
class StockOption:
def __init__(self, DateDownloaded, OptionData):
self.DateDownloaded = DateDownloaded
self.OptionData = OptionData
def ForCsv(self):
return [self.DateDownloaded, self.Optiondata]
def extract_options(TableRowsFromBeautifulSoup):
optionsList = []
for opt in range(0, len(TableRowsFromBeautifulSoup))
optionsList.append(StockOption(data parsed from TableRows arg))
return optionsList
def run_proc():
symbolList = read in csv file of tickers
for symb in symbolList:
webStr = #write the connection string
try:
with urllib.request.urlopen(webStr) as url: page = url.read()
soup = BeautifulSoup(page)
if soup.text.find('There are no All Markets results for') == -1:
tbls = soup.findAll('table')
if len(tbls[9]) > 1:
expStrings = soup.findAll('td', text=True, attrs={'align': 'right'})[0].contents[0].split()
expDate = datetime.date(int(expStrings[6]), int(currMonth), int(expStrings[5].replace(',', '')))
calls = extract_options(tbls[9], symb, 'Call', expDate)
puts = extract_options(tbls[13], symb, 'Put', expDate)
optionsRows = optionsRows + calls
optionsRows = optionsRows + puts
except urllib.error.HTTPError as err:
if err.code == 404:
pass
else:
raise
opts = [0] * (len(optionsRows))
for option in range(0, len(optionsRows)):
opts[option] = optionsRows[option].ForCsv()
#Write to the csv file.
with open('C:/OptionsChains.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerows(opts)
if __name__ == '__main__':
run_proc()

There are some mistakes in the abbreviated code you have given, so it is a little hard to understand the code. If you could show more code and check it, it will be easier to understand your problem.
From the code and problem description, I have some advice to share with you:
In run_proc() function, it read webpage for every symbol. If the urls are the same or some urls are repeated, how about read webpages for just one time and write them to memory or hardware, then analyze page contents for every symbol? It will save
BeautifulSoup is easy to write code, but a little slow in performance. If lxml can do your work, it will save a lot time on analyzing webpage contents.
Hope it will help.

I was pointed in the right direction from the following post (thanks to the authors btw):
How to scrape more efficiently with Urllib2?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.