Multithreaded screen scraping help needed

Multithreaded screen scraping help needed - python

I'm relatively new to python, and I'm working through a screen- scraping application that gathers data from multiple financial sites. I have four procedures for now. Two run in just a couple minutes, and the other two... hours each. These two look up information on particular stock symbols that I have in a csv file. There are 4,000+ symbols that I'm using. I know enough to know that the vast majority of the time spent is in IO over the wire. It's essential that I get these down to 1/2 hour each (or, better. Is that too ambitious?) for this to be of any practical use to me. I'm using python 3 and BeautifulSoup.
I have the general structure of what I'm doing below. I've abbreviated conceptually non essential sections. I'm reading many threads on multiple calls/ threads at once to speed things up, and it seems like there are a lot of options. Can anyone point me in the right direction that I should pursue, based on the structure of what I have so far? It'd be a huge help. I'm sure it's obvious, but this procedure gets called along with the other data download procs in a main driver module. Thanks in advance...
from bs4 import BeautifulSoup
import misc modules
class StockOption:
def __init__(self, DateDownloaded, OptionData):
self.DateDownloaded = DateDownloaded
self.OptionData = OptionData
def ForCsv(self):
return [self.DateDownloaded, self.Optiondata]
def extract_options(TableRowsFromBeautifulSoup):
optionsList = []
for opt in range(0, len(TableRowsFromBeautifulSoup))
optionsList.append(StockOption(data parsed from TableRows arg))
return optionsList
def run_proc():
symbolList = read in csv file of tickers
for symb in symbolList:
webStr = #write the connection string
try:
with urllib.request.urlopen(webStr) as url: page = url.read()
soup = BeautifulSoup(page)
if soup.text.find('There are no All Markets results for') == -1:
tbls = soup.findAll('table')
if len(tbls[9]) > 1:
expStrings = soup.findAll('td', text=True, attrs={'align': 'right'})[0].contents[0].split()
expDate = datetime.date(int(expStrings[6]), int(currMonth), int(expStrings[5].replace(',', '')))
calls = extract_options(tbls[9], symb, 'Call', expDate)
puts = extract_options(tbls[13], symb, 'Put', expDate)
optionsRows = optionsRows + calls
optionsRows = optionsRows + puts
except urllib.error.HTTPError as err:
if err.code == 404:
pass
else:
raise
opts = [0] * (len(optionsRows))
for option in range(0, len(optionsRows)):
opts[option] = optionsRows[option].ForCsv()
#Write to the csv file.
with open('C:/OptionsChains.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerows(opts)
if __name__ == '__main__':
run_proc()

There are some mistakes in the abbreviated code you have given, so it is a little hard to understand the code. If you could show more code and check it, it will be easier to understand your problem.
From the code and problem description, I have some advice to share with you:
In run_proc() function, it read webpage for every symbol. If the urls are the same or some urls are repeated, how about read webpages for just one time and write them to memory or hardware, then analyze page contents for every symbol? It will save
BeautifulSoup is easy to write code, but a little slow in performance. If lxml can do your work, it will save a lot time on analyzing webpage contents.
Hope it will help.

I was pointed in the right direction from the following post (thanks to the authors btw):
How to scrape more efficiently with Urllib2?

Related

How can I optimize JSON webscrape?

I recently got interested in bitcoin and the whole blockchain thing. Since every transaction is public by design, I thought it would be interesting to investigate the number of wallets, size of transactions and such. But the current block height of bitcoin is 732,324 which is quite a lot of blocks to walk through one after each other. Thus, I want to obtain the hash code for each block so I can multi-thread grabbing the transactions.
Blockchain links one block after each other and if I go to the first block (the genesis block) and simply find the next block in the chain and so forth until the end, I should have what I need. I am quite new in python, but below is my code for obtaining the hashes and saving them to a file. However, at the current rate it would take 30-40 hours to complete on my machine. Thus, is there a more efficient way to solve the problem?
#imports
from urllib.request import urlopen
from datetime import datetime
import json
#Setting start parameters
genesisBlock = "000000000019d6689c085ae165831e934ff763ae46a2a6c172b3f1b60a8ce26f"
baseurl = "https://blockchain.info/rawblock/"
i = 0 #counter for tracking progress
#Set HASH
blockHASH = genesisBlock
#Open file to save results
filePath = "./blocklist.tsv"
fileObject = open(filePath, 'a')
#Write header, if first line
if i == 0:
fileObject.write("blockHASH\theight\ttime\tn_tx\n")
#Start walking through each block
while blockHASH != "" :
#Print progress
if i % 250 == 0:
print(str(i)+"|"+datetime.now().strftime("%H:%M:%S"))
# store the response of URL
url = baseurl+blockHASH
response = urlopen(url)
# storing the JSON response in data
data_json = json.loads(response.read().decode())
#Write result to file
fileObject.write(blockHASH+"\t"+
str(data_json["height"])+"\t"+
str(data_json["time"])+"\t"+
str(data_json["n_tx"])+"\t"+
"\n")
#increment counter
i = i + 1
#Set new hash
blockHASH = data_json["next_block"][0]
if i > 1000: break #or just let it run until completion
# Close the file
fileObject.close()

While this doesn't comment directly on the efficiency of your approach, using orjson or rapidjson will definitely speed up your results, since they're both quite a bit faster than the standard json library.
Rapidjson can be swapped in as easily as just doing import rapidjson as json whereas orjson you have to make a couple changes, as described on their github page, but nothing too hard.

When I run the code, it runs without errors, but the csv file is not created, why?

I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).

Solved
Google add captcha couse i use to many request
its work when i use mobile internet

Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.

Python iterating through pages google search

I am working on a larger code that will display the links of the results for a Google Newspaper search and then analyze those links for certain keywords and context and data. I've gotten everything this one part to work, and now when I try to iterate through the pages of results I come to a problem. I'm not sure how to do this without an API, which I do not know how to use. I just need to be able to iterate through multiple pages of search results so that I can then apply my analysis to it. It seems like there is a simple solution to iterating through the pages of results, but I am not seeing it.
Are there any suggestions on ways to approach this problem? I am somewhat new to Python and have been teaching myself all of these scraping techniques, so I'm not sure if I'm just missing something simple here. I know this may be an issue with Google restricting automated searches, but even pulling in the first 100 or so links would be beneficial. I have seen examples of this from regular Google searches but not from Google Newspaper searches
Here is the body of the code. If there are any lines where you have suggestions, that would be helpful. Thanks in advance!
def get_page_tree(url):
page = requests.get(url=url, verify=False)
return html.fromstring(page.text)
def find_other_news_sources(initial_url):
forwarding_identifier = '/url?q='
google_news_search_url = "https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro"
google_news_search_tree = get_page_tree(url=google_news_search_url)
other_news_sources_links = [a_link.replace(forwarding_identifier, '').split('&')[0] for a_link in google_news_search_tree.xpath('//a//#href') if forwarding_identifier in a_link]
return other_news_sources_links
links = find_other_news_sources("https://www.google.com/search? hl=en&gl=us&tbm=nws&authuser=0&q=ohio+pay-to-play&oq=ohio+pay-to-play&gs_l=news-cc.3..43j43i53.2737.7014.0.7207.16.6.0.10.10.0.64.327.6.6.0...0.0...1ac.1.NAJRCoza0Ro")
with open('textanalysistest.csv', 'wt') as myfile:
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
for row in links:
print(row)

I'm looking into building a parser for a site with similar structure to google's (i.e. a bunch of consecutive results pages, each with a table of content of interest).
A combination of the Selenium package (for page-element based site navigation) and BeautifulSoup (for html parsing) seems like it's the weapon of choice for harvesting written content. You may find them useful too, although I have no idea what kinds of defenses google has in place to deter scraping.
A possible implementation for Mozilla Firefox using selenium, beautifulsoup and geckodriver:
from bs4 import BeautifulSoup, SoupStrainer
from bs4.diagnose import diagnose
from os.path import isfile
from time import sleep
import codecs
from selenium import webdriver
def first_page(link):
"""Takes a link, and scrapes the desired tags from the html code"""
driver = webdriver.Firefox(executable_path = 'C://example/geckodriver.exe')#Specify the appropriate driver for your browser here
counter=1
driver.get(link)
html = driver.page_source
filter_html_table(html)
counter +=1
return driver, counter
def nth_page(driver, counter, max_iter):
"""Takes a driver instance, a counter to keep track of iterations, and max_iter for maximum number of iterations. Looks for a page element matching the current iteration (how you need to program this depends on the html structure of the page you want to scrape), navigates there, and calls mine_page to scrape."""
while counter <= max_iter:
pageLink = driver.find_element_by_link_text(str(counter)) #For other strategies to retrieve elements from a page, see the selenium documentation
pageLink.click()
scrape_page(driver)
counter+=1
else:
print("Done scraping")
return
def scrape_page(driver):
"""Takes a driver instance, extracts html from the current page, and calls function to extract tags from html of total page"""
html = driver.page_source #Get html from page
filter_html_table(html) #Call function to extract desired html tags
return
def filter_html_table(html):
"""Takes a full page of html, filters the desired tags using beautifulsoup, calls function to write to file"""
only_td_tags = SoupStrainer("td")#Specify which tags to keep
filtered = BeautifulSoup(html, "lxml", parse_only=only_td_tags).prettify() #Specify how to represent content
write_to_file(filtered) #Function call to store extracted tags in a local file.
return
def write_to_file(output):
"""Takes the scraped tags, opens a new file if the file does not exist, or appends to existing file, and writes extracted tags to file."""
fpath = "<path to your output file>"
if isfile(fpath):
f = codecs.open(fpath, 'a') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
else:
f = codecs.open(fpath, 'w') #using 'codecs' to avoid problems with utf-8 characters in ASCII format.
f.write(output)
f.close()
return
After this, it is just a matter of calling:
link = <link to site to scrape>
driver, n_iter = first_page(link)
nth_page(driver, n_iter, 1000) # the 1000 lets us scrape 1000 of the result pages
Note that this script assumes that the result pages you are trying to scrape are sequentially numbered, and those numbers can be retrieved from the scraped page's html using 'find_element_by_link_text'. For other strategies to retrieve elements from a page, see the selenium documentation here.
Also, note that you need to download the packages on which this depends, and the driver that selenium needs in order to talk with your browser (in this case geckodriver, download geckodriver, place it in a folder, and then refer to the executable in 'executable_path')
If you do end up using these packages, it can help to spread out your server requests using the time package (native to python) to avoid exceeding a maximum number of requests allowed to the server off of which you are scraping. I didn't end up needing it for my own project, but see here, second answer to the original question, for an implementation example with the time module used in the fourth code block.
Yeeeeaaaahhh... If someone with higher rep could edit and add some links to beautifulsoup, selenium and time documentations, that would be great, thaaaanks.

never ending for loop in website mapping crawler

I am working on my first python project. I want to make a crawler that visits a website to extract all its links (with a depth of 2). It should store the links in two lists that form a ono-to-one register that correlates source links to the corresponding target links they contain. Then it should create a csv file with two columns (Target and Source), so I can open it with gephi to create a graph showing the site's topographic structure.
The code breaks down at the for loop in the code execution section, it just never stops extracting links... (I've tried with a fairly small blog, it just never ends). What is the problem? How can I solve it?
A few points to consider:
- I'm really new to programming and python so I realize that my code must be really unpythonic. Also, as I have been searching for ways to build the code and solve my problems it is somewhat patchy, sorry. Thanks for your help!
myurl = raw_input("Introduce URL to crawl => ")
Dominios = myurl.split('.')
Dominio = Dominios[1]
#Variables Block 1
Target = []
Source = []
Estructura = [Target, Source]
links = []
#Variables Block 2
csv_columns = ['Target', 'Source']
csv_data_list = Estructura
currentPath = os.getcwd()
csv_file = "crawleo_%s.csv" % Dominio
# Block 1 => Extract links from a page
def page_crawl(seed):
try:
for link in re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(seed).read(), re.I):
Source.append(seed)
Target.append(link)
links.append(link)
except IOError:
pass
# Block 2 => Write csv file
def WriteListToCSV(csv_file, csv_columns, csv_data_list):
try:
with open(csv_file, 'wb') as csvfile:
writer = csv.writer(csvfile, dialect='excel', quoting=csv.QUOTE_NONNUMERIC)
writer.writerow(csv_columns)
writer.writerows(izip(Target, Source))
except IOError as (errno, strerror):
print("I/O error({0}): {1}".format(errno, strerror))
return
# Block 3 => Code execution
page_crawl(myurl)
seed_links = (links)
for sublink in seed_links: # Problem is with this loop
page_crawl(sublink)
seed_sublinks = (links)
## print Estructura # Line just to check if code was working
#for thirdlinks in seed_sublinks: # Commented out until prior problems are solved
# page_crawl(thirdlinks)
WriteListToCSV(csv_file, csv_columns, csv_data_list)

seed_links and links points to the same list. So when you are adding elements to links in the page_crawl function you are also extending the list that the for loop is looping over. What you need to do is to clone the list where you create seed_links.
This is because Python passes objects by reference. That is, multiple variables can point to the same object under different names!
If you want to see this with your own eyes, try print sublink inside the for loop. You will notice that there are more links printed than you initially put in. You will probably also notice that you are trying to loop over the entire web :-)

I don't see immediately what is wrong. However there are several remarks about this:
you work with global variables which is bad practice. You better use a local variable that is passed back by the return.
Is it possible that a link in the second level refers back to the first level? That way you have a loop in the data. You need to make provisions for that to prevent a loop. So you need to investigate what is returned.
I would implement this recursively (with the earlier provisions), because that makes the code simpler albeit a little more abstract.

How to improve performance through Python multithreading

I'm new to Python and multithreading, so please bear with me.
I'm writing a script to process domains in a list through Web of Trust, a service that ranks websites from 1-100 on a scale of "trustworthiness", and write them to a CSV. Unfortunately Web of Trust's servers can take quite a while to respond, and processing 100k domains can take hours.
My attempts at multithreading so far have been disappointing -- attempting to modify the script from this answer gave threading errors, I believe because some threads took too long to resolve.
Here's my unmodified script. Can someone help me multithread it, or point me to a good multithreading resource? Thanks in advance.
import urllib
import re
text = open("top100k", "r")
text = text.read()
text = re.split("\n+", text)
out = open('output.csv', 'w')
for element in text:
try:
content = urllib.urlopen("http://api.mywot.com/0.4/public_query2?target=" + element)
content = content.read()
content = content[content.index('<application name="0" r="'):content.index('" c')]
content = element + "," + content[25] + content[26] + "\n"
out.write(content)
except:
pass

A quick scan through the WoT API documentation shows that as well as the public_query2 request that you are using, there is a public_query_json request that lets you get the data in batches of up to 100. I would suggest using that before you start flooding their server with lots of requests in parallel.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multithreaded screen scraping help needed - python

I was pointed in the right direction from the following post (thanks to the authors btw): How to scrape more efficiently with Urllib2?

Related

How can I optimize JSON webscrape?

When I run the code, it runs without errors, but the csv file is not created, why?

Python iterating through pages google search

never ending for loop in website mapping crawler

How to improve performance through Python multithreading

Categories

Resources