Python - Twisted : POST in a form - python
Hi Guys !
I'm still discovering Twisted and I've made this script to parse the content of HTML table into excel. This script is working well ! My question is how can I do the same, for only one webpage (http://bandscore.ielts.org/) but with a lot of POST requests to be able to fetch all the results, parse it with beautifulSoup and then put them into excel ?
Parsing the source and putting it into excel is O.K, but I don't know how to do a POST request with Twisted in order to implement that in
This is the script I use for parsing (with Twisted) a lot of different pages
(I want to be able to write the same script, but with a lot of different POST data on the same page and not a lot of pages):
from twisted.web import client
from twisted.internet import reactor, defer
from bs4 import BeautifulSoup as BeautifulSoup
import time
import xlwt
start = time.time()
wb = xlwt.Workbook(encoding='utf-8')
ws = wb.add_sheet("BULATS_IA_PARSED")
global x
x = 0
Countries_List = ['Afghanistan','Armenia','Brazil','Argentina','Armenia','Australia','Austria','Azerbaijan','Bahrain','Bangladesh','Belgium','Belize','Bolivia','Bosnia and Herzegovina','Brazil','Brunei Darussalam','Bulgaria','Cameroon','Canada','Central African Republic','Chile','China','Colombia','Costa Rica','Croatia','Cuba','Cyprus','Czech Republic','Denmark','Dominican Republic','Ecuador','Egypt','Eritrea','Estonia','Ethiopia','Faroe Islands','Fiji','Finland','France','French Polynesia','Georgia','Germany','Gibraltar','Greece','Grenada','Hong Kong','Hungary','Iceland','India','Indonesia','Iran','Iraq','Ireland','Israel','Italy','Jamaica','Japan','Jordan','Kazakhstan','Kenya','Kuwait','Latvia','Lebanon','Libya','Liechtenstein','Lithuania','Luxembourg','Macau','Macedonia','Malaysia','Maldives','Malta','Mexico','Monaco','Montenegro','Morocco','Mozambique','Myanmar (Burma)','Nepal','Netherlands','New Caledonia','New Zealand','Nigeria','Norway','Oman','Pakistan','Palestine','Papua New Guinea','Paraguay','Peru','Philippines','Poland','Portugal','Qatar','Romania','Russia','Saudi Arabia','Serbia','Singapore','Slovakia','Slovenia','South Africa','South Korea','Spain','Sri Lanka','Sweden','Switzerland','Syria','Taiwan','Thailand','Trinadad and Tobago','Tunisia','Turkey','Ukraine','United Arab Emirates','United Kingdom','United States','Uruguay','Uzbekistan','Venezuela','Vietnam']
urls = ["http://www.cambridgeesol.org/institutions/results.php?region=%s&type=&BULATS=on" % Countries for Countries in Countries_List]
def finish(results):
global x
for result in results:
print 'GOT PAGE', len(result), 'bytes'
soup = BeautifulSoup(result)
tableau = soup.findAll('table')
try:
rows = tableau[3].findAll('tr')
print("Fetching")
for tr in rows:
cols = tr.findAll('td')
y = 0
x = x + 1
for td in cols:
texte_bu = td.text
texte_bu = texte_bu.encode('utf-8')
#print("Writing...")
#print texte_bu
ws.write(x,y,td.text)
y = y + 1
except(IndexError):
print("No IA for this country")
pass
reactor.stop()
waiting = [client.getPage(url) for url in urls]
defer.gatherResults(waiting).addCallback(finish)
reactor.run()
wb.save("IALOL.xls")
print "Elapsed Time: %s" % (time.time() - start)
Thank you very much in advance for your help !
You have two options. Keep using getPage and tell it to POST instead of GET or use Agent.
The API documentation for getPage directs you to the API documentation for HTTPClientFactory to discover additional supported options.
The latter API documentation explicitly covers method and implies (but does a bad job of explaining) postdata. So, to make a POST with getPage:
d = getPage(url, method='POST', postdata="hello, world, or whatever.")
There is a howto-style document for Agent (linked from the overall web howto documentation index. This gives examples of sending a request with a body (ie, see the FileBodyProducer example).
Related
Beautifulsoup unable to get data from mds-data-table from morningstar
I'm trying to get the dividend information from morningstar. The following code works for scraping info from finviz but the dividend information is not the same as my broker platform. symbol = 'bxs' morningstar_url = 'https://www.morningstar.com/stocks/xnys/' + symbol + '/dividends' http = urllib3.PoolManager() response = http.request('GET', morningstar_url) soup = BeautifulSoup(response.data, 'lxml') html = list(soup.children)[1] [type(item) for item in list(soup.children)] def display_elements(L, show = 0): test = list(L.children) if(show): for i in range(len(test)): print(i) print(test[i]) print() return(test) test = display_elements(html,1) I have no issue printing out the elements but cannot find the element that houses the information such as "Total Yield %" of 2.8%. How do I get inside the mds-data-table to extract the information?
Great question! I've actually worked on this specifically, but years ago. Morningstar will only load the tables after running a script to prevent this exact type of scraping behavior. If you view source generally, immediately on load, you won't be able to see any HTML. What your going to want to do is find the JavaScript code that is loading the elements, and hook up bs4 to use that. You'll have to poke around the files, but somewhere deep in those js files, you'll find a dynamic URL. It'll be hidden, but it'll be in there somewhere. I'll go look at some of my old code and see if i can find something that helps. So here's an edited sample of what used to work for me: from urllib.request import urlopen exchange = 'NYSE' ticker = 'V' if exchange == 'NYSE': exchange_code = "XNYS" elif exchange in ["NasdaqNM", "NASDAQ"]: exchange_code = "XNAS" else: logging.info("Unknown Exchange Code for {}".format(stock.symbol)) return time_now = int(time.time()) time_delay = int(time.time()+150) morningstar_raw = urlopen(f'http://financials.morningstar.com/ajax/ReportProcess4HtmlAjax.html?&t={exchange_code}:{ticker}®ion=usa&culture=en-US&cur=USD&reportType=is&period=12&dataType=A&order=asc&columnYear=5&rounding=3&view=raw&r=354589&callback=jsonp{time_now}&_={time_delay}') print(morningstar_raw) Granted this solution is from a file lasted edited sometime in 2018, and they may have changed up their scripting, but you can find this and much more on my github project wxStocks
Python Web scraping: Too slow in execution: How to Optimize for speed
I have written a web scraping program in python. It is working correctly but takes 1.5 hrs to execute. I am not sure how to optimize the code. The logic of the code is every country have many ASN's with the client name. I am getting all the ASN links (for e.g https://ipinfo.io/AS2856) Using Beautiful soup and regex to get the data as JSON. The output is just a simple JSON. import urllib.request import bs4 import re import json url = 'https://ipinfo.io/countries' SITE = 'https://ipinfo.io' def url_to_soup(url): #bgp.he.net is filtered by user-agent req = urllib.request.Request(url) opener = urllib.request.build_opener() html = opener.open(req) soup = bs4.BeautifulSoup(html, "html.parser") return soup def find_pages(page): pages = [] for link in page.find_all(href=re.compile('/countries/')): pages.append(link.get('href')) return pages def get_each_sites(links): mappings = {} print("Scraping Pages for ASN Data...") for link in links: country_page = url_to_soup(SITE + link) current_country = link.split('/')[2] for row in country_page.find_all('tr'): columns = row.find_all('td') if len(columns) > 0: #print(columns) current_asn = re.findall(r'\d+', columns[0].string)[0] print(SITE + '/AS' + current_asn) s = str(url_to_soup(SITE + '/AS' + current_asn)) asn_code, name = re.search(r'(?P<ASN_CODE>AS\d+) (?P<NAME>[\w.\s(&)]+)', s).groups() #print(asn_code[2:]) #print(name) country = re.search(r'.*href="/countries.*">(?P<COUNTRY>.*)?</a>', s).group("COUNTRY") print(country) registry = re.search(r'Registry.*?pb-md-1">(?P<REGISTRY>.*?)</p>', s, re.S).group("REGISTRY").strip() #print(registry) # flag re.S make the '.' special character match any character at all, including a newline; mtch = re.search(r'IP Addresses.*?pb-md-1">(?P<IP>.*?)</p>', s, re.S) if mtch: ip = mtch.group("IP").strip() #print(ip) mappings[asn_code[2:]] = {'Country': country, 'Name': name, 'Registry': registry, 'num_ip_addresses': ip} return mappings main_page = url_to_soup(url) country_links = find_pages(main_page) #print(country_links) asn_mappings = get_each_sites(country_links) print(asn_mappings) The output is as expected, but super slow.
You probably don't want to speed your scraper up. When you scrape a site, or connect in a way that humans don't (24/7), it's good practice to keep requests to a minium so that You blend in the background noise You don't (D)DoS the website in hope of finishing faster, while racking up costs for the wbesite owner What you can do, however, is get the AS names and numbers from this website (see this SO answers), and recover the IPs using PyASN
I think what you need is to do multiple processes of the scraping . This can be done using the python multiprocessing package. Since multi threads programs do not work in python because of the GIL (Global Interpreter Lock). There are plenty of examples of how to do this. Here are some: Multiprocessing Spider Speed up Beautiful soup scraper
Why requests.post have no response with Clustal Omega service?
import requests MSA_request=""">G1 MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL >G2 MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL >G3 MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL""" q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"} r=requests.post("http://www.ebi.ac.uk/Tools/msa/clustalo/",data=q) This is my script, I send this request to website, but the result looks like I did nothing, web service didn't receive my request. This method used to be fine with other website, maybe this page with a pop window to ask cookie agreement?
The form on the page you are referring to has a separate URL, namely http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi you can verify this with a DOM inspector in your browser. So in order to proceed with requests, you need to access the right page r=requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data=q) this will submit a job with your input data, it doesn't return the result directly. To check the results, it's necessary to extract the job ID from the previous response and then generate another request (with no data) to http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=... However, you should definitely check whether this programatic access is compatible with the TOS of that website... Here is an example: from lxml import html import requests import sys import time MSA_request=""">G1 MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL >G2 MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL >G3 MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL""" q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"} r = requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data = q) tree = html.fromstring(r.text) title = tree.xpath('//title/text()')[0] #check the status and get the job id status, job_id = map(lambda s: s.strip(), title.split(':', 1)) if status != "Job running": sys.exit(1) #it might take some time for the job to finish time.sleep(10) #download the results r = requests.get("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=%s" % (job_id)) #prints the full response #print(r.text) #isolate the alignment block tree = html.fromstring(r.text) alignment = tree.xpath('//pre[#id="alignmentContent"]/text()')[0] print(alignment)
How to retrieve google URL from search query
So I'm trying to create a Python script that will take a search term or query, then search google for that term. It should then return 5 URL's from the result of the search term. I spent many hours trying to get PyGoogle to work. But later found out Google no longer supports the SOAP API for search, nor do they provide new license keys. In a nutshell, PyGoogle is pretty much dead at this point. So my question here is... What would be the most compact/simple way of doing this? I would like to do this entirely in Python. Thanks for any help
Use BeautifulSoup and requests to get the links from the google search results import requests from bs4 import BeautifulSoup keyword = "Facebook" #enter your keyword here search = "https://www.google.co.uk/search?sclient=psy-ab&client=ubuntu&hs=k5b&channel=fs&biw=1366&bih=648&noj=1&q=" + keyword r = requests.get(search) soup = BeautifulSoup(r.text, "html.parser") container = soup.find('div',{'id':'search'}) url = container.find("cite").text print(url)
What issues are you having with pygoogle? I know it is no longer supported, but I've utilized that project on many occasions and it would work fine for the menial task you have described. Your question did make me curious though--so I went to Google and typed "python google search". Bam, found this repository. Installed with pip and within 5 minutes of browsing their documentation got what you asked: import google for url in google.search("red sox", num=5, stop=1): print(url) Maybe try a little harder next time, ok?
Here, link is the xgoogle library to do the same. I tried similar to get top 10 links which also counts words in links we are targeting. I have added the code snippet for your reference : import operator import urllib #This line will import GoogleSearch, SearchError class from xgoogle/search.py file from xgoogle.search import GoogleSearch, SearchError my_dict = {} print "Enter the word to be searched : " #read user input yourword = raw_input() try: #This will perform google search on our keyword gs = GoogleSearch(yourword) gs.results_per_page = 80 #get google search result results = gs.get_results() source = '' #loop through all result to get each link and it's contain for res in results: #print res.url.encode('utf8') #this will give url parsedurl = res.url.encode("utf8") myurl = urllib.urlopen(parsedurl) #above line will read url content, in below line we parse the content of that web page source = myurl.read() #This line will count occurrence of enterd keyword in our webpage count = source.count(yourword) #We store our result in dictionary data structure. For each url, we store it word occurent. Similar to array, this is dictionary my_dict[parsedurl] = count except SearchError, e: print "Search failed: %s" % e print my_dict #sorted_x = sorted(my_dict, key=lambda x: x[1]) for key in sorted(my_dict, key=my_dict.get, reverse=True): print(key,my_dict[key])
Why am I getting an httplib2.RedirectLimit error?
I have a script that takes a URL and returns the value of the page's <title> tag. After a few hundred or so runs, I always get the same error: File "/home/edmundspenser/Dropbox/projects/myfiles/titlegrab.py", line 202, in get_title status, response = http.request(pageurl) File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1390, in _request raise RedirectLimit("Redirected more times than rediection_limit allows.", response, content) httplib2.RedirectLimit: Redirected more times than rediection_limit allows. My function looks like: def get_title(pageurl): http = httplib2.Http() status, response = http.request(pageurl) x = BeautifulSoup(response, parseOnlyThese=SoupStrainer('title')) x = str(x) y = x[7:-8] z = y.split('-')[0] return z Pretty straightforward. I used try and except and time.sleep(1) to give it time to maybe get unstuck if that was the issue but so far nothing has worked. And I don't want to pass on it. Maybe the website is rate-limiting me? edit: As of right now the script doesn't work at all, it runs into said error with the first request. I have a json file of over 80,000 URLs of www.wikiart.org painting pages. For each one I run my function to get the title. So: print repr(get_title('http://www.wikiart.org/en/vincent-van-gogh/van-gogh-s-chair-1889')) returns "Van Gogh's Chair"
Try using the Requests library. On my end, there seems to be no rate-limiting that I've seen. I was able to retrieve 13 titles in 21.6s. See below: Code: import requests as rq from bs4 import BeautifulSoup as bsoup def get_title(url): r = rq.get(url) soup = bsoup(r.content) title = soup.find_all("title")[0].get_text() print title.split(" - ")[0] def main(): urls = [ "http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891", "http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879", "http://www.wikiart.org/en/claude-monet/dandelions", "http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506", "http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903", "http://www.wikiart.org/en/jean-michel-basquiat/boxer", "http://www.wikiart.org/en/fernand-leger/three-women-1921", "http://www.wikiart.org/en/alphonse-mucha/flower-1897", "http://www.wikiart.org/en/alphonse-mucha/ruby", "http://www.wikiart.org/en/georges-braque/musical-instruments-1908", "http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954", "http://www.wikiart.org/en/m-c-escher/lizard-1", "http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring" ] for url in urls: get_title(url) if __name__ == "__main__": main() Output: Tiger in a Tropical Storm (Surprised!) The Green Dancer Dandelions The Little Owl Farmhouse with Birch Trees Boxer Three Women Flower Ruby Musical Instruments The evening gown Lizard The Girl with a Pearl Earring [Finished in 21.6s] However, out of personal ethics, I don't recommend doing it like this. With a fast connection, you'll pull data too fast. Allowing the scrape to sleep every 20 pages or so for a few seconds won't hurt. EDIT: An even faster version, using grequests, which allows asynchronous requests to be made. This pulls the same data above in 2.6s, nearly 10 times faster. Again, limit your scrape speed out of respect for the site. import grequests as grq from bs4 import BeautifulSoup as bsoup def get_title(response): soup = bsoup(response.content) title = soup.find_all("title")[0].get_text() print title.split(" - ")[0] def main(): urls = [ "http://www.wikiart.org/en/henri-rousseau/tiger-in-a-tropical-storm-surprised-1891", "http://www.wikiart.org/en/edgar-degas/the-green-dancer-1879", "http://www.wikiart.org/en/claude-monet/dandelions", "http://www.wikiart.org/en/albrecht-durer/the-little-owl-1506", "http://www.wikiart.org/en/gustav-klimt/farmhouse-with-birch-trees-1903", "http://www.wikiart.org/en/jean-michel-basquiat/boxer", "http://www.wikiart.org/en/fernand-leger/three-women-1921", "http://www.wikiart.org/en/alphonse-mucha/flower-1897", "http://www.wikiart.org/en/alphonse-mucha/ruby", "http://www.wikiart.org/en/georges-braque/musical-instruments-1908", "http://www.wikiart.org/en/rene-magritte/the-evening-gown-1954", "http://www.wikiart.org/en/m-c-escher/lizard-1", "http://www.wikiart.org/en/johannes-vermeer/the-girl-with-a-pearl-earring" ] rs = (grq.get(u) for u in urls) for i in grq.map(rs): get_title(i) if __name__ == "__main__": main()