I'm trying to parse data from one web page. This web page allows you (according to robots.txt) to send 2000 requests per minute.
The problem is that everything I tried is too slow. The response of this server is quite quick.
from multiprocessing.pool import ThreadPool as Pool
import datetime
import lxml.html as lh
from bs4 import BeautifulSoup
import requests
with open('products.txt') as f:
lines = f.readlines()
def update(url):
html = requests.get(url).content # 3 seconds
doc = lh.parse(html) # almost 12 seconds (with commented line below)
soup = BeautifulSoup(html) # almost 12 seconds (with commented line above)
pool = Pool(10)
for line in lines[0:100]:
pool.apply_async(update, args=(line[:-1],))
pool.close()
now = datetime.datetime.now()
pool.join()
print datetime.datetime.now() - now
As I commented into the code - when I try to do just html = requests.get(url) for 100 urls, the time is great - under 3 seconds.
The problem is when I want to use some parser - the preprocessing of the html costs about 10 seconds and more which is too much.
What would you recommend me to lower the time?
EDIT: I tried to use SoupStrainer - it is slightly faster but nothing too much noticeable - 9 seconds.
html = requests.get(url).content
product = SoupStrainer('div',{'class': ['shopspr','bottom']})
soup = BeautifulSoup(html,'lxml', parse_only=product)
Depending on what you need to extract from the pages, perhaps you don't need the full DOM. Perhaps you could get away with HTMLParser(html.parser in Python3). It should be faster.
I would decouple getting the pages from parsing the pages, e.g. two Pools, one is getting the pages and filling a queue, where the other pool is getting pages from the queue and parsing them. This would use the available resources slightly better, but it wont be a big speed up. As a side effect should the server start serving pages with a bigger delay, you could still keep the workers busy with a big queue.
Related
while parsing with bs4,lxml and looping trough my files with ThreadPoolExecutor threading I am experiencing really slow results. I have searched the whole internet for faster alternatives on this one. The parsing of about 2000 cached files (1.2mb each) takes about 15 minutes (max_workes=500) on ThreadPoolExecutor. I even tried parsing on Amazon AWS with 64 vCPU but the speed remains about the same.
I want to parse about 100k files which will takes hours of parsing. Why isn't the parsing not efficiently speeding up while multiprocessing? One file takes about 2seconds. Why is the speed of 10 files with (max_workes=10) not equaling 2 seconds as well since the threads are concurrent? Ok maybe 3 seconds would be fine. But it takes ages the more files there are, the more workers I assign to the threads. It get's to the point of about ~ 25 seconds per file instead of 2 seconds when running a sinlge file/thread. Why?
What can I do to get the desired 2-3 seconds per file while multiprocessing?
If not possible, any faster solutions?
My approch for the parsing is the following:
with open('cache/'+filename, 'rb') as f:
s = BeautifulSoup(f.read(), 'lxml')
s.whatever()
Any faster way to scrape my cached files?
// the multiprocessor:
from concurrent.futures import ThreadPoolExecutor, as_completed
future_list = []
with ThreadPoolExecutor(max_workers=500) as executor:
for filename in os.listdir("cache/"):
if filename.endswith(".html"):
fNametoString = str(filename).replace('.html','')
x = fNametoString.split("_")
EAN = x[0]
SKU = x[1]
future = executor.submit(parser,filename,EAN,SKU)
future_list.append(future)
else:
pass
for f in as_completed(future_list):
pass
Try:
from bs4 import BeautifulSoup
from multiprocessing import Pool
def worker(filename):
with open(filename, "r") as f_in:
soup = BeautifulSoup(f_in.read(), "html.parser")
# do some processing here
return soup.h1.text.strip()
if __name__ == "__main__":
filenames = ["page1.html", "page2.html", ...] # you can use glob module or populate the filenames list other way
with Pool(4) as pool: # 4 is number of processes
for result in pool.imap_unordered(worker, filenames):
print(result)
I wrote this piece of code and it is currently running, scraping a massive amount of data. So far the loop has run 800 times. It will have to run ~16,000 times to grab all of the data.
Generally speaking, how can I optimize web-scraping code, or am I at the mercy of requests.get?
import json
import requests
import pandas as pd
from pandas.io.json import json_normalize
headers = {}
p = {}
a = int(p['page'])
df = pd.DataFrame()
while True:
p['page'] = str(a)
try:
a += 1
r = requests.get('URL',headers=headers, params=p)
complete_json = r.json()
print('success')
df_data = pd.DataFrame.from_dict(json_normalize(complete_json['explore_vintage']['matches']), orient='columns')
df = df.append(df_data)
except:
False
df.to_excel('output.xlsx', encoding='utf8')
df.to_csv("output.csv")
print(df.head)
There are a couple of optimizations I can see right off the bat.
The first thing you could add here is parallel processing via async requests. The requests library is synchronous and as you are seeing – it's going to block until each page fully processes. There are a number of libraries that the requests project officially recommends. If you go this route you'll need to more explicitly define a terminating condition rather than a try/except block inside an infinite while loop.
This is all pseudo-code primarily ripped from their examples, but you can see how this might work:
from requests_futures.sessions import FuturesSession
from concurrent.futures import as_completed
import json
import time
def response_hook(resp, *args, **kwargs):
with open(f'tmp/{time.thread_time_ns()}.json', 'wb') as fp:
parsed = resp.json()
fp.write(json.dumps(parsed).encode('utf-8'))
futures_session = FuturesSession()
futures_session.hooks['response'] = response_hook
with futures_session as session:
futures = [
session.get(f'https://jsonplaceholder.typicode.com/todos/{i}', hooks={'response': response_hook}) for i in range(16000)
]
for future in as_completed(futures):
resp = future.result()
The parsing of the data into a dataframe is an obvious bottleneck. This is currently going to continue slowing down as the dataframe becomes larger and larger. I don't know the size of these JSON responses but if you're fetching 16k responses I imagine this would quickly grind to a halt once you've eaten through your memory. If possible, I would recommend decoupling the scraping and transforming operations. Save all of your scraped data into their own, independent JSON files (as in the example above). If you save each response separately and the scraping completes you can then loop over all of the saved contents, parse them, then output to Excel and CSV. Note that depending on the size of the JSON files you may still run into memory issues, you at least won't block the scraping process and can deal with the output processing separately.
I made a script for my friend in python(I lost the bet),which download all of the thumbnail images(about 50 imgs,one img size is 20 kB) by data-thumb_url tag in which are urls.
Can this code can break the website or affect on it badly(I mean DDOS or smth like that)?I used it few times for 10,20,30 imgs and it works perfectly,and website works normal too(it is very popular website,one of the most in the world and it wasn't said that webscraping is illegal in this website),but I need to know if it's safe code.
from PIL import Image
from bs4 import BeautifulSoup
import requests
import os
url = '' #(here is the url of website)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
images = soup.find_all('img')
listt = []
for i in images:
try:
listt.append(i['data-thumb_url'])
except KeyError:
pass
for i in range(len(listt)):
img = Image.open(requests.get(listt[i], stream = True).raw)
img.save("image"+str(i)+".jpg")
I know that it's a little bit silly question considering 80-100 millions of website views per day,and for example free extensions/websites/programs to download images from website,but I'm new in bs and requests in Python + I'm anxious.
Firstly, in the code you provided, you the list of URLs as listt in most places, but you call it lista when appending.
Secondly, no, your code isn't going to break a website. Because you are just running a Python in a single thread, it will only make 1 request at a time. If you wanted to be super cautious, you can add a time.sleep inside your last for loop, but that isn't really necessary.
If you are accessing multiple urls, even with the sleep, the site might have other security measures that you might trigger (prove you are a human). This might cause your script to fail when you try accessing other pages...
Without seeing the site you are hitting and the number of pages, it is hard to say for certain. But Cargo23 is right, as it stands now,you wont be breaking the site anytime soon.
I am working on a larger code that will display the links of the results for a Google Newspaper search and then analyze those links for certain keywords and context and data. I've gotten everything this one part to work, and now when I try to iterate through the pages of results I come to a problem. I'm not sure how to do this without an API, which I do not know how to use. I just need to be able to iterate through multiple pages of search results so that I can then apply my analysis to it. It seems like there is a simple solution to iterating through the pages of results, but I am not seeing it.
Are there any suggestions on ways to approach this problem? I am somewhat new to Python and have been teaching myself all of these scraping techniques, so I'm not sure if I'm just missing something simple here. I know this may be an issue with Google restricting automated searches, but even pulling in the first 100 or so links would be beneficial. I have seen examples of this from regular Google searches but not from Google Newspaper searches
Here is the body of the code. If there are any lines where you have suggestions, that would be helpful. Thanks in advance!
def get_page_tree(url):
page = requests.get(url=url, verify=False)
return html.fromstring(page.text)
def find_other_news_sources(initial_url):
forwarding_identifier = '/url?q='
google_news_search_url = "https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro"
google_news_search_tree = get_page_tree(url=google_news_search_url)
other_news_sources_links = [a_link.replace(forwarding_identifier, '').split('&')[0] for a_link in google_news_search_tree.xpath('//a//#href') if forwarding_identifier in a_link]
return other_news_sources_links
links = find_other_news_sources("https://www.google.com/search? hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro")
with open('textanalysistest.csv', 'wt') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
for row in links:
print(row)
I'm looking into building a parser for a site with similar structure to google's (i.e. a bunch of consecutive results pages, each with a table of content of interest).
A combination of the Selenium package (for page-element based site navigation) and BeautifulSoup (for html parsing) seems like it's the weapon of choice for harvesting written content. You may find them useful too, although I have no idea what kinds of defenses google has in place to deter scraping.
A possible implementation for Mozilla Firefox using selenium, beautifulsoup and geckodriver:
from bs4 import BeautifulSoup, SoupStrainer
from bs4.diagnose import diagnose
from os.path import isfile
from time import sleep
import codecs
from selenium import webdriver
def first_page(link):
"""Takes a link, and scrapes the desired tags from the html code"""
driver = webdriver.Firefox(executable_path = 'C://example/geckodriver.exe')#Specify the appropriate driver for your browser here
counter=1
driver.get(link)
html = driver.page_source
filter_html_table(html)
counter +=1
return driver, counter
def nth_page(driver, counter, max_iter):
"""Takes a driver instance, a counter to keep track of iterations, and max_iter for maximum number of iterations. Looks for a page element matching the current iteration (how you need to program this depends on the html structure of the page you want to scrape), navigates there, and calls mine_page to scrape."""
while counter <= max_iter:
pageLink = driver.find_element_by_link_text(str(counter)) #For other strategies to retrieve elements from a page, see the selenium documentation
pageLink.click()
scrape_page(driver)
counter+=1
else:
print("Done scraping")
return
def scrape_page(driver):
"""Takes a driver instance, extracts html from the current page, and calls function to extract tags from html of total page"""
html = driver.page_source #Get html from page
filter_html_table(html) #Call function to extract desired html tags
return
def filter_html_table(html):
"""Takes a full page of html, filters the desired tags using beautifulsoup, calls function to write to file"""
only_td_tags = SoupStrainer("td")#Specify which tags to keep
filtered = BeautifulSoup(html, "lxml", parse_only=only_td_tags).prettify() #Specify how to represent content
write_to_file(filtered) #Function call to store extracted tags in a local file.
return
def write_to_file(output):
"""Takes the scraped tags, opens a new file if the file does not exist, or appends to existing file, and writes extracted tags to file."""
fpath = "<path to your output file>"
if isfile(fpath):
f = codecs.open(fpath, 'a') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
else:
f = codecs.open(fpath, 'w') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
return
After this, it is just a matter of calling:
link = <link to site to scrape>
driver, n_iter = first_page(link)
nth_page(driver, n_iter, 1000) # the 1000 lets us scrape 1000 of the result pages
Note that this script assumes that the result pages you are trying to scrape are sequentially numbered, and those numbers can be retrieved from the scraped page's html using 'find_element_by_link_text'. For other strategies to retrieve elements from a page, see the selenium documentation here.
Also, note that you need to download the packages on which this depends, and the driver that selenium needs in order to talk with your browser (in this case geckodriver, download geckodriver, place it in a folder, and then refer to the executable in 'executable_path')
If you do end up using these packages, it can help to spread out your server requests using the time package (native to python) to avoid exceeding a maximum number of requests allowed to the server off of which you are scraping. I didn't end up needing it for my own project, but see here, second answer to the original question, for an implementation example with the time module used in the fourth code block.
Yeeeeaaaahhh... If someone with higher rep could edit and add some links to beautifulsoup, selenium and time documentations, that would be great, thaaaanks.
I'm relatively new to python, and I'm working through a screen- scraping application that gathers data from multiple financial sites. I have four procedures for now. Two run in just a couple minutes, and the other two... hours each. These two look up information on particular stock symbols that I have in a csv file. There are 4,000+ symbols that I'm using. I know enough to know that the vast majority of the time spent is in IO over the wire. It's essential that I get these down to 1/2 hour each (or, better. Is that too ambitious?) for this to be of any practical use to me. I'm using python 3 and BeautifulSoup.
I have the general structure of what I'm doing below. I've abbreviated conceptually non essential sections. I'm reading many threads on multiple calls/ threads at once to speed things up, and it seems like there are a lot of options. Can anyone point me in the right direction that I should pursue, based on the structure of what I have so far? It'd be a huge help. I'm sure it's obvious, but this procedure gets called along with the other data download procs in a main driver module. Thanks in advance...
from bs4 import BeautifulSoup
import misc modules
class StockOption:
def __init__(self, DateDownloaded, OptionData):
self.DateDownloaded = DateDownloaded
self.OptionData = OptionData
def ForCsv(self):
return [self.DateDownloaded, self.Optiondata]
def extract_options(TableRowsFromBeautifulSoup):
optionsList = []
for opt in range(0, len(TableRowsFromBeautifulSoup))
optionsList.append(StockOption(data parsed from TableRows arg))
return optionsList
def run_proc():
symbolList = read in csv file of tickers
for symb in symbolList:
webStr = #write the connection string
try:
with urllib.request.urlopen(webStr) as url: page = url.read()
soup = BeautifulSoup(page)
if soup.text.find('There are no All Markets results for') == -1:
tbls = soup.findAll('table')
if len(tbls[9]) > 1:
expStrings = soup.findAll('td', text=True, attrs={'align': 'right'})[0].contents[0].split()
expDate = datetime.date(int(expStrings[6]), int(currMonth), int(expStrings[5].replace(',', '')))
calls = extract_options(tbls[9], symb, 'Call', expDate)
puts = extract_options(tbls[13], symb, 'Put', expDate)
optionsRows = optionsRows + calls
optionsRows = optionsRows + puts
except urllib.error.HTTPError as err:
if err.code == 404:
pass
else:
raise
opts = [0] * (len(optionsRows))
for option in range(0, len(optionsRows)):
opts[option] = optionsRows[option].ForCsv()
#Write to the csv file.
with open('C:/OptionsChains.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerows(opts)
if __name__ == '__main__':
run_proc()
There are some mistakes in the abbreviated code you have given, so it is a little hard to understand the code. If you could show more code and check it, it will be easier to understand your problem.
From the code and problem description, I have some advice to share with you:
In run_proc() function, it read webpage for every symbol. If the urls are the same or some urls are repeated, how about read webpages for just one time and write them to memory or hardware, then analyze page contents for every symbol? It will save
BeautifulSoup is easy to write code, but a little slow in performance. If lxml can do your work, it will save a lot time on analyzing webpage contents.
Hope it will help.
I was pointed in the right direction from the following post (thanks to the authors btw):
How to scrape more efficiently with Urllib2?