I have an application that I am trying to load test with Locust. If I know the parameters of a post in advance, I can add them to a post and that works fine:
self.client.post("/Login", {"Username":"user", "Password":"a"})
The application uses a bunch of hidden fields that get sent when the page is posted interactively. The content of these fields is dynamic and assigned by the server at runtime to manage sessions etc. e.g.
<input type="hidden" name="$$submitid" value="view:xid1:xid2:xid143:xid358">
Is there a way I can pick these up to add to my post data? I know the names of the hidden inputs.
You write a function to extract this data by using PyQuery. You just need to call it before sending post request. If you want to create a bunch of data you can call it in on_start function store them in an array, then use it in tasks. See the example below:
from locust import HttpLocust, TaskSet, task
from pyquery import PyQuery
class UserBehaviour(TaskSet):
def get_data(self, url, locator):
data = []
request = self.client.get(url)
pq = PyQuery(request.content)
link_elements = pq(locator)
for link in link_elements:
if key in link.attrib and "http" not in link.attrib[key]:
data.append(link.attrib[key])
return data
#task
def test_get_thing(self):
data_ = self.get_data("/url/to/send/request", "#review-ul > li > div > a", "href")
self.client.post("url", data = data_)
Related
I am making a website that tracks population statistics. The site needs to update about every 5 seconds with the latest information.
Here is the relevant code for displaying the pandas df on the page (in a file titled "home.html"):
{% block content %}
<h1>Population Tracker</h1>
{% for index, label,data in pop_df.itertuples %}
<div class = "row__data">
<p>{{label}}: {{data}}</p>
</div>
{% endfor %}
{% endblock content %}
Here is the code for my scraper (in a separate file called "scraper.py")
class Scraper():
def __init__(self):
self.URL = "https://countrymeters.info/en/Japan"
def scrape(self):
"Scrapes the population data once"
page = requests.get(self.URL)
soup = BeautifulSoup(page.content,'html.parser')
data_div = soup.find_all('div',class_="data_div")[0]
table = data_div.findAll("table")[0]
tr = table.findAll('tr')
labels = []
numbers = []
for n, i in enumerate(tr):
number = i.findAll('td',{"class":"counter"})[0].text # numbers
label = i.findAll('td',{"class":"data_name"})[0].text # labels
labels.append(label)
numbers.append(number)
pop_df = pd.DataFrame(
{
'Labels':labels,
'Data': numbers
}
)
return pop_df
In my views.py file, here is what I have done:
from django.shortcuts import render
from bsoup_tracker.scraper import Scraper
scraper = Scraper()
df = scraper.scrape()
def home(request):
context = {
'pop_df':df
}
return render(request,'tracker/home.html',context)
Basically, I would like to be able to call the render onto my home.html page every 5 seconds to reupdate the page, without needing refreshes. I have tried to look elsewhere and see that Ajax could help; however I do not know where to begin.
Instead of using Django to render the page, create API and call every after 5 minutes and after getting the results, refresh the HTML content using JavaScript.
If you need more information please let me know.
AJAX stands for "asynchronous JavaScript and XML" so as you thought that would be the way to go if you need to fetch data from your backend and refresh the interface.
The base component to do so in the XmlHttpRequest object in vanilla JavaScript. However, I strongly advice using a library like jQuery, to me it's really easier to use. With vanilla JS, jQuery or any other library you choose, you can modify DOM to expose data you got from your backend. The major drawback is that you will probably end up with not so clean code which will get harder and harder to maintain.
Nowadays the most common solution would be to use djangorestframework (not mandatory, you can also use django's JsonResponse) to create an API along with a nodeJS framework like React or VueJS to create your interface using API's data. That way you will have a lot more control on your interface.
Finally, if you need to have some sort of live website (pulling data and refreshing interface every 5 seconds seems like a poor design pattern to me), you should use websockets for your frontend and ASGI in backend (instead of WSGI). Django-channel is a nice package to do so, but just Google "django websockets" and you will find a lot of documentation.
import requests
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r=requests.post("http://www.ebi.ac.uk/Tools/msa/clustalo/",data=q)
This is my script, I send this request to website, but the result looks like I did nothing, web service didn't receive my request. This method used to be fine with other website, maybe this page with a pop window to ask cookie agreement?
The form on the page you are referring to has a separate URL, namely
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi
you can verify this with a DOM inspector in your browser.
So in order to proceed with requests, you need to access the right page
r=requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data=q)
this will submit a job with your input data, it doesn't return the result directly. To check the results, it's necessary to extract the job ID from the previous response and then generate another request (with no data) to
http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=...
However, you should definitely check whether this programatic access is compatible with the TOS of that website...
Here is an example:
from lxml import html
import requests
import sys
import time
MSA_request=""">G1
MGCTLSAEDKAAVERSKMIDRNLREDGEKAAREVKLLLL
>G2
MGCTVSAEDKAAAERSKMIDKNLREDGEKAAREVKLLLL
>G3
MGCTLSAEERAALERSKAIEKNLKEDGISAAKDVKLLLL"""
q={"stype":"protein","sequence":MSA_request,"outfmt":"clustal"}
r = requests.post("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolform.ebi",data = q)
tree = html.fromstring(r.text)
title = tree.xpath('//title/text()')[0]
#check the status and get the job id
status, job_id = map(lambda s: s.strip(), title.split(':', 1))
if status != "Job running":
sys.exit(1)
#it might take some time for the job to finish
time.sleep(10)
#download the results
r = requests.get("http://www.ebi.ac.uk/Tools/services/web_clustalo/toolresult.ebi?jobId=%s" % (job_id))
#prints the full response
#print(r.text)
#isolate the alignment block
tree = html.fromstring(r.text)
alignment = tree.xpath('//pre[#id="alignmentContent"]/text()')[0]
print(alignment)
I'm using pipelines to cache the documents from Scrapy crawls into a database, so that I can reparse them if I change the item parsing logic without having to hit the server again.
What's the best way to have Scrapy process from the cache instead of trying to perform a normal crawl?
I like scrapy's support for CSS and XPath selectors, else I would just hit the database separately with a lxml parser.
For a time, I wasn't caching the document at all and using Scrapy in a normal fashion - parsing the items on the fly - but I've found that changing the item logic requires a time and resource intensive recrawl. Instead, I'm now caching the document body along with the item parse, and I want to have the option to have Scrapy iterate through those documents from a database instead of crawling the target URL.
How do I go about modifying Scrapy to give me the option to pass it a set of documents and then parsing them individually as if it had just pulled them down from the web?
I think a custom Downloader Middleware is a good way to go. The idea is to have this middleware return a source code directly from the database and don't let Scrapy make any HTTP requests.
Sample implementation (not tested and definitely needs error-handling):
import re
import MySQLdb
from scrapy.http import Response
from scrapy.exceptions import IgnoreRequest
from scrapy import log
from scrapy.conf import settings
class CustomDownloaderMiddleware(object):
def __init__(self, *args, **kwargs):
super(CustomDownloaderMiddleware, self).__init__(*args, **kwargs)
self.connection = MySQLdb.connect(**settings.DATABASE)
self.cursor = self.connection.cursor()
def process_request(self, request, spider):
# extracting product id from a url
product_id = re.search(request.url, r"(\d+)$").group(1)
# getting cached source code from the database by product id
self.cursor.execute("""
SELECT
source_code
FROM
products
WHERE
product_id = %s
""", product_id)
source_code = self.cursor.fetchone()[0]
# making HTTP response instance without actually hitting the web-site
return Response(url=request.url, body=source_code)
And don't forget to activate the middleware.
I am running a series of selenium functional tests of a django site for acceptance testing purposes. I notice that when I run these and exceptions occur, I get back an entire page ( eg a HTTP status 500 ).
I am running acceptance testing using a simple loop and storing the outputted html to a db using the django orm:
def my_functional_tests(request):
import requests
from mytests.models import Entry
for i in range(3):
p1 = { ....... }
r1 = requests.post('http://127.0.0.1:8000/testfunction1/',data=p1)
..............
entry = Entry(output1 = r1.text, output2 = r2.text, output3 = r3.text)
entry.save()
return HttpResponse("completed")
My Model is defined as (where the outputs are the HTML results of 3 functional tests ):
class Entry(models.Model):
output1 = models.CharField(max_length=240)
output2 = models.CharField(max_length=240)
output3 = models.CharField(max_length=240)
When I get an error, the resulting approximately 65K webpage causes an exception on saving, and breaks the testing. I want to get as much info as possible, so I could increase the max_length to lets say 70,000 to store the entire page, but is there a more concise way to capture and store relevant data including the specific errors to the db ?
If you did this with Django's testing client, you could get more concise information--but by using requests, you're really hitting your page as a web browser would, so the full page is what you get (but 65K for a 500 Error page? Wow).
Could you embed in the error page an HTML comment with a marker and concise explanation?
<html>
<h1>Error</h1>
... 64k of stuff follows ...
<!-- ERR:"info about error" -->
</html>
That way, you could parse the results for that error code and store just that.
Of course, you'll want to make sure you don't put anything confidential in that error message or, if you do, that you emit it only when in DEBUG mode or when the request comes from localhost, or logged in as staff, or whatever other security constraint would work.
Slightly prettier would be to write a piece of middleware that emits the error-info as an HTTP Header; then your page could stay the same and you could look at the response headers for your error info.
I'm using Python to scrape data from a number of web pages that have simple HTML input forms, like the 'Username:' form at the bottom of this page:
http://www.w3schools.com/html/html_forms.asp (this is just a simple example to illustrate the problem)
Firefox Inspect Element indicates this form field has the following HTML structure:
<form name="input0" target="_blank" action="html_form_action.asp" method="get">
Username:
<input name="user" size="20" type="text"></input>
<input value="Submit" type="submit"></input>
</form>
All I want to do is fill out this form and get the resulting page:
http://www.w3schools.com/html/html_form_action.asp?user=ThisIsMyUserName
Which is what is produced in my browser by entering 'ThisIsMyUserName' in the 'Username' field and pressing 'Submit'. However, every method that I have tried (details below) returns the contents of the original page containing the unaltered form without any indication the form data I submitted was recognized, i.e. I get the content from the first link above in response to my request, when I expected to receive the content of the second link.
I suspect the problem has to do with action="html_form_action.asp" in the form above, or perhaps some kind of hidden field I'm missing (I don't know what to look for - I'm new to form submission). Any suggestions?
HERE IS WHAT I'VE TRIED SO FAR:
Using urllib.requests in Python 3:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Encode dict
example_data = example_data.encode('utf-8')
# Create request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
request = urllib.request.Request(example_url, data=example_data)
# Create opener and install
my_url_opener = urllib.request.build_opener() # no handlers
urllib.request.install_opener(my_url_opener)
# Open the page and read content
web_page = urllib.request.urlopen(request)
content = web_page.read()
# Save content to file
my_html_file = open('my_html_file.html', 'wb')
my_html_file.write(content)
But what is returned to me and saved in 'my_html_file.html' is the original page containing
the unaltered form without any indication that my form data was recognized, i.e. I get this page in response: qqqhttp://www.w3schools.com/html/html_forms.asp
...which is the same thing I would have expected if I made this request without the
data parameter at all (which would change the request from a POST to a GET).
Naturally the first thing I did was check whether my request was being constructed properly:
# Just double-checking the request is set up correctly
print("GET or POST?", request.get_method())
print("DATA:", request.data)
print("HEADERS:", request.header_items())
Which produces the following output:
GET or POST? POST
DATA: b'user=ThisIsMyUserName'
HEADERS: [('Content-length', '21'), ('Content-type', 'application/x-www-form-urlencoded'), ('User-agent', 'Python-urllib/3.3'), ('Host', 'www.w3schools.com')]
So it appears the POST request has been structured correctly. After re-reading the
documentation and unsuccessfuly searching the web for an answer to this problem, I
moved on to a different tool: the requests module. I attempted to perform the same task:
import requests
example_url = 'http://www.w3schools.com/html/html_forms.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.post(example_url, params=data_to_send)
contents = response.content
And I get the same exact result. At this point I'm thinking maybe this is a Python 3
issue. So I fire up my trusty Python 2.7 and try the following:
import urllib, urllib2
data = urllib.urlencode({'user' : 'ThisIsMyUserName'})
resp = urllib2.urlopen('http://www.w3schools.com/html/html_forms.asp', data)
content = resp.read()
And I get the same result again! For thoroughness I figured I'd attempt to achieve the
same result by encoding the dictionary values into the url and attempting a GET request:
# Using Python 3
# Construct the url for the GET request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
form_values = {'user': 'ThisIsMyUserName'}
example_data = urllib.parse.urlencode(form_values)
final_url = example_url + '?' + example_data
print(final_url)
This spits out the following value for final_url:
qqqhttp://www.w3schools.com/html/html_forms.asp?user=ThisIsMyUserName
I plug this into my browser and I see that this page is exactly the same as
the original page, which is exactly what my program is downloading.
I've also tried adding additional headers and cookie support to no avail.
I've tried everything I can think of. Any idea what could be going wrong?
The form states an action and a method; you are ignoring both. The method states the form uses GET, not POST, and the action tells you to send the form data to html_form_action.asp.
The action attribute acts like any other URL specifier in an HTML page; unless it starts with a scheme (so with http://..., https://..., etc.) it is relative to the current base URL of the page.
The GET HTTP method adds the URL-encoded form parameters to the target URL with a question mark:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Create request
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
get_url = example_url + '?' + example_data
# Open the page and read content
web_page = urllib.request.urlopen(get_url)
print(web_page.read().decode(web_page.info().get_param('charset', 'utf8')))
or, using requests:
import requests
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.get(example_url, params=data_to_send)
contents = response.text
print(contents)
In both examples I also decoded the response to Unicode text (something requests makes easier for me with the response.text attribute).