I'm new to Python and multithreading, so please bear with me.
I'm writing a script to process domains in a list through Web of Trust, a service that ranks websites from 1-100 on a scale of "trustworthiness", and write them to a CSV. Unfortunately Web of Trust's servers can take quite a while to respond, and processing 100k domains can take hours.
My attempts at multithreading so far have been disappointing -- attempting to modify the script from this answer gave threading errors, I believe because some threads took too long to resolve.
Here's my unmodified script. Can someone help me multithread it, or point me to a good multithreading resource? Thanks in advance.
import urllib
import re
text = open("top100k", "r")
text = text.read()
text = re.split("\n+", text)
out = open('output.csv', 'w')
for element in text:
try:
content = urllib.urlopen("http://api.mywot.com/0.4/public_query2?target=" + element)
content = content.read()
content = content[content.index('<application name="0" r="'):content.index('" c')]
content = element + "," + content[25] + content[26] + "\n"
out.write(content)
except:
pass
A quick scan through the WoT API documentation shows that as well as the public_query2 request that you are using, there is a public_query_json request that lets you get the data in batches of up to 100. I would suggest using that before you start flooding their server with lots of requests in parallel.
Related
Im newbie in Python, this is my first work with REST API in python. First let me explain what i wanted to do. I have a csv file which have name of a product and some other details, these are missing data after migration. So now my job is to check in the downstream application1 if they contain these product or it is missing there too. if it is missing there should dig up back and back.
So Now I have API of Application 1(this would give the productname and details if that exists) and have an API for OAuth 2. This will create me a token and im using that token to access API of Application 1(it would look like this https://Applciationname/rest/< productname >) i get this < productname > from a list which is retrieved from first column of csv file. Everthing is working fine but my list is having 3000 entries it is taking almost 2 hours for me to complete.
Is there any fastest way to check this, BTW im calling token API only once. This is how my code looks like
list=[]
reading csv and appedning to list #using with open and csv reader here
get_token=requests.get(tokenurl,OAuthdetails) #similar type of code
token_dict=json.loads(get_token.content.decode())
token=token_dict['access_token']
headers={
'Authorization': 'Bearer'+' '+str(token)
}
url= https://Applciationname/rest/
for element in list:
full_url=url+element
api_response=requests.get(full_url,headers)
recieved_data=json.loads(api_response.content.decode())
if api_response.status_code=200 and len(recieved_data)!=0:
writing the element value to text file "successcall" text file #using with open here
else:
writing the element value to text file "failurecall" text file #using with open here
Now could you please help me optimizing this, so that ill be finding the product names which are not in APP 1 faster
You could see Threading for your for loop. Like so:
import threading
lock = threading.RLock()
thread_list = []
def check_api(full_url):
api_response=requests.get(full_url,headers)
recieved_data=json.loads(api_response.content.decode())
if api_response.status_code=200 and len(recieved_data)!=0:
# dont forget to add a lock to writing to the file
with lock:
with open("successcall.txt", "a") as f:
f.write(recieved_data)
else:
# again, dont forget to add with lock like the one above
# writing the element value to text file "failurecall" text file #using with open here
for element in list:
full_url = url+element
t = threading.Thread(target=check_api, args=(full_url, ))
thread_list.append(t)
# start all threads
for thread in thread_list:
thread.start()
# wait for them all to finish
for thread in thread_list:
thread.finish()
You should also not write to the same file while using Threads since it might cause some problems unless you use locks
I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).
Solved
Google add captcha couse i use to many request
its work when i use mobile internet
Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.
So I have this and it's working:
import requests
import csv
my_list = [item1, item2, item3, item4]
def write_results(file, list_to_iterate):
with open(file, 'w') as f:
fields = ("column_item", "column_value")
wr = csv.DictWriter(f, delimiter=",", fieldnames=fields, lineterminator='\n')
wr.writeheader()
for each_item in list_to_iterate:
try:
r = requests.get('https://www.somewebsite.com/api/something?this='+each_item).json()
value = str(r['value'])
except:
value = "none"
wr.writerow({'column_item': each_item, 'column_value': value})
write_results('spreadsheet.csv', my_list)
I'm basically writing the outputs that I fetch from a JSON output page to a CSV. My function works great and operates exactly as intended. The only drawback is that I'm actually iterating through a huge list, so I'm having to send a ton of requests. I'm wondering if it is possible for me to send my requests in parallel rather than waiting for a response then sending the next one. The order in which I write to the CSV doesn't even matter, so it's not a problem if I retrieve responses out of sync with my list.
I've tried and looked at all these different methods, but I just cannot get my head around a solid solution. All the examples I've seen are more geared towards web scraping and they are using a total different set of modules (not using requests).
Any help is gladly appreciated. This is driving me nuts. Also, if it's possible, I'd like to use as many native modules as possible.
Note: Python 3.5
I'm relatively new to python, and I'm working through a screen- scraping application that gathers data from multiple financial sites. I have four procedures for now. Two run in just a couple minutes, and the other two... hours each. These two look up information on particular stock symbols that I have in a csv file. There are 4,000+ symbols that I'm using. I know enough to know that the vast majority of the time spent is in IO over the wire. It's essential that I get these down to 1/2 hour each (or, better. Is that too ambitious?) for this to be of any practical use to me. I'm using python 3 and BeautifulSoup.
I have the general structure of what I'm doing below. I've abbreviated conceptually non essential sections. I'm reading many threads on multiple calls/ threads at once to speed things up, and it seems like there are a lot of options. Can anyone point me in the right direction that I should pursue, based on the structure of what I have so far? It'd be a huge help. I'm sure it's obvious, but this procedure gets called along with the other data download procs in a main driver module. Thanks in advance...
from bs4 import BeautifulSoup
import misc modules
class StockOption:
def __init__(self, DateDownloaded, OptionData):
self.DateDownloaded = DateDownloaded
self.OptionData = OptionData
def ForCsv(self):
return [self.DateDownloaded, self.Optiondata]
def extract_options(TableRowsFromBeautifulSoup):
optionsList = []
for opt in range(0, len(TableRowsFromBeautifulSoup))
optionsList.append(StockOption(data parsed from TableRows arg))
return optionsList
def run_proc():
symbolList = read in csv file of tickers
for symb in symbolList:
webStr = #write the connection string
try:
with urllib.request.urlopen(webStr) as url: page = url.read()
soup = BeautifulSoup(page)
if soup.text.find('There are no All Markets results for') == -1:
tbls = soup.findAll('table')
if len(tbls[9]) > 1:
expStrings = soup.findAll('td', text=True, attrs={'align': 'right'})[0].contents[0].split()
expDate = datetime.date(int(expStrings[6]), int(currMonth), int(expStrings[5].replace(',', '')))
calls = extract_options(tbls[9], symb, 'Call', expDate)
puts = extract_options(tbls[13], symb, 'Put', expDate)
optionsRows = optionsRows + calls
optionsRows = optionsRows + puts
except urllib.error.HTTPError as err:
if err.code == 404:
pass
else:
raise
opts = [0] * (len(optionsRows))
for option in range(0, len(optionsRows)):
opts[option] = optionsRows[option].ForCsv()
#Write to the csv file.
with open('C:/OptionsChains.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
a.writerows(opts)
if __name__ == '__main__':
run_proc()
There are some mistakes in the abbreviated code you have given, so it is a little hard to understand the code. If you could show more code and check it, it will be easier to understand your problem.
From the code and problem description, I have some advice to share with you:
In run_proc() function, it read webpage for every symbol. If the urls are the same or some urls are repeated, how about read webpages for just one time and write them to memory or hardware, then analyze page contents for every symbol? It will save
BeautifulSoup is easy to write code, but a little slow in performance. If lxml can do your work, it will save a lot time on analyzing webpage contents.
Hope it will help.
I was pointed in the right direction from the following post (thanks to the authors btw):
How to scrape more efficiently with Urllib2?
I am trying to learn python and also create a web utility. One task I am trying to accomplish is creating a single html file which can be run locally but link to everything it needs to look like the original web page. (if you are going to ask why i want this, its because it may act of a part of a utility i am creating, or if not, just for education) So i have two questions, a theoretical one and a practical one:
1) Is this, for visual (as opposed to functional) purposes, possible? Can a html page work offline while linking to everything it needs online? or if their something fundamental about having the html file itself execute on the web server which does not allow this to be possible? How far can I go with it?
2) I have started a python script which de-relativises (made that one up) linked elements on a html page, but I am a noob so most likely I missed some elements or attributes which would also link to outside resources. I have noticed after trying a few pages that the one in the code below does not work properly, their appears to be a .js file which is not linking correctly. (the first of many problems to come) Assuming the answer to my first question was at least a partial yes, can anyone help me fix the code for this website?
Thank you.
Update, I missed the script tag on this, but even after I added it it still does not work correctly.
import lxml
import sys
from lxml import etree
from StringIO import StringIO
from lxml.html import fromstring, tostring
import urllib2
from urlparse import urljoin
site = "www.script-tutorials.com/advance-php-login-system-tutorial/"
output_filename = "output.html"
def download(site):
response = urllib2.urlopen("http://"+site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = lxml.html.fromstring(html_input)
for element in tags_to_derealitivise:
for tag in active_html.xpath(str(element+"[#"+"src"+"]")):
tag.attrib["src"] = urljoin("http://"+site, tag.attrib.get("src"))
for tag in active_html.xpath(str(element+"[#"+"href"+"]")):
tag.attrib["href"] = urljoin("http://"+site, tag.attrib.get("href"))
return lxml.html.tostring(active_html)
active_html = ""
tags_to_derealitivise = ("//img", "//a", "//link", "//embed", "//audio", "//video", "//script")
print "downloading..."
active_html = download(site)
active_html = derealitivise(site, active_html)
print "writing file..."
output_file = open (output_filename, "w")
output_file.write(active_html)
output_file.close()
Furthermore, I could make the code more through by checking all of the elements...
It would look kind of like this, but I do not know the exact way to iterate through all of the elements. This is a seperate problem, and I will most likely figure it out by the time anyone responds...:
def derealitivise(site, html_input):
active_html = lxml.html.fromstring(html_input)
for element in active_html.xpath:
for tag in active_html.xpath(str(element+"[#"+"src"+"]")):
tag.attrib["src"] = urljoin("http://"+site, tag.attrib.get("src"))
for tag in active_html.xpath(str(element+"[#"+"href"+"]")):
tag.attrib["href"] = urljoin("http://"+site, tag.attrib.get("href"))
return lxml.html.tostring(active_html)
update
Thanks to Burhan Khalid's solution, which seemed too simple to be viable at first glance, I got it working. The code is so simple most of you will most likely not require it, but I will post it anyway incase it helps:
import lxml
import sys
from lxml import etree
from StringIO import StringIO
from lxml.html import fromstring, tostring
import urllib2
from urlparse import urljoin
site = "www.script-tutorials.com/advance-php-login-system-tutorial/"
output_filename = "output.html"
def download(site):
response = urllib2.urlopen(site)
html_input = response.read()
return html_input
def derealitivise(site, html_input):
active_html = html_input.replace('<head>', '<head> <base href='+site+'>')
return active_html
active_html = ""
print "downloading..."
active_html = download(site)
active_html = derealitivise(site, active_html)
print "writing file..."
output_file = open (output_filename, "w")
output_file.write(active_html)
output_file.close()
Despite all of this, and its great simplicity, the .js object running on the website I have listed in the script still will not load correctly. Does anyone know if this is possible to fix?
while i am trying to make only the html file offline, while using the
linked resources over the web.
This is a two step process:
Copy the HTML file and save it to your local directory.
Add a BASE tag in the HEAD section, and point the href attribute of it to the absolute URL.
Since you want to learn how to do it yourself, I will leave it at that.
#Burhan has an easy answer using <base href="..."> tag in the <head>, and it works as you have found out. I ran the script you posted, and the page downloaded fine. As you noticed, some of the JavaScript now fails. This can be for multiple reasons.
If you are opening the HTML file as a local file:/// URL, the page may not work. Many browsers heavily sandbox local HTML files, not allowing them to perform network requests or examine local files.
The page may perform XmlHTTPRequests or other network operations to the remote site, which will be denied for cross domain scripting reasons. Looking in the JS console, I see the following errors for the script you posted:
XMLHttpRequest cannot load http://www.script-tutorials.com/menus.php?give=menu. Origin http://localhost:8000 is not allowed by Access-Control-Allow-Origin.
Unfortunately, if you do not have control of www.script-tutorials.com, there is no easy way around this.