Simple POST using urllib with Python 3.3 - python

I'm participating in a hash-breaking contest and I'm trying to automate posting strings to an html form and getting the hash score back. So far I've managed to get SOMETHING posted to the url, but its not the exact string I'm expecting, and thus the value returned for the hash is way off from the one obtained by just typing in the string manually.
import urllib.parse, urllib.request
url = "http://almamater.xkcd.com/?edu=una.edu"
data = "test".encode("ascii")
header = {"Content-Type":"application/octet-stream"}
req = urllib.request.Request(url, data, header)
f = urllib.request.urlopen(req)
print(f.read())
#parse f to pull out hash
I obtain the following hash from the site:
0fff9563bb3279289227ac77d319b6fff8d7e9f09da1247b72a0a265cd6d2a62645ad547ed8193db48cff847c06494a03f55666d3b47eb4c20456c9373c86297d630d5578ebd34cb40991578f9f52b18003efa35d3da6553ff35db91b81ab890bec1b189b7f52cb2a783ebb7d823d725b0b4a71f6824e88f68f982eefc6d19c6
This differs considerably from what I expected, which is what you get if you type in "test" (no quotes) into the form:
e21091dbb0d61bc93db4d1f278a04fe1a51165fb7262c7da31f886ae09ff3e04c41483c500db2792c59742958d8f7f39fe4f4f2cdc7940b7b25e3289b89d344e06f76305b9de525933b5df5dae2a37388f82cf76374fe363587acfb49b9d2c8fc131ef4a32c762be083b07330989b298d60e312f56a6b8a4c0f53c9b59864fb7
Obviously the code isn't doing what I'm expecting it to do. Any tips?

When you submit your form data, it also includes the field name, so when you submit "test" the data submitted actually looks like "hashable=test". Try changing your data like this:
data = "hashable=test".encode("ascii")
or alternatively:
data = urllib.parse.urlencode({'hashable': 'test'})

Related

How do I automatically change a part of a url to query a website a set number of times in python?

I have very basic knowledge of python, so sorry if my question sounds dumb.
I need to query a website for a personal project I am doing, but I need to query it 500 times, and each time I need to change 1 specific part of the url, then take the data and upload it to gsheets.
(The () signifies what part of the url I need to change)
'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=(symbol)&apikey=apikey'
I thought about using while and format {} to do it, but I was unsure how to change the string each time, bar writing out the names for variables by hand (defeating the whole purpose of this).
I already have a list of the symbols I need to use, but I don't know how to input them
Example of how I get 1 piece of data
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=MMM&apikey=demo'
r = requests.get(url)
data = r.json()
Example of what I'd like to change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=AOS&apikey=demo'
r = requests.get(url)
data = r.json()
#then change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=ABT&apikey=demo'
r = requests.get(url)
data = r.json()
so on and so forth, 500 times.
You might combine .format with for loop, consider following simple example
symbols = ["abc","xyz","123"]
for s in symbols:
url = 'https://www.example.com?symbol={}'.format(s)
print(url)
output
https://www.example.com?symbol=abc
https://www.example.com?symbol=xyz
https://www.example.com?symbol=123
You might also elect to use any other way of formatting, e.g. f-string (requires python3.6 or newer) in which case code would be
symbols = ["abc","xyz","123"]
for s in symbols:
url = f'https://www.example.com?symbol={s}'
print(url)
Alternatively you might params optional argument of requests.get function as follows
import requests
symbols = ["abc","xyz","123"]
for s in symbols:
r = requests.get('https://www.example.com', params={'symbol':s})
print(r.url)
output
https://www.example.com/?symbol=abc
https://www.example.com/?symbol=xyz
https://www.example.com/?symbol=123

Parsing Json in python 3, get email from API

I'm trying to do a little code that gets the emails (and other things in the future) from an API. But I'm getting "TypeError: list indices must be integers or slices, not str" and I don't know what to do about it. I've been looking at other questions here but I still don't get it. I might be a bit slow when it comes to this.
I've also been watching some tutorials on the tube, and done the same as them, but still getting different errors. I run Python 3.5.
Here is my code:
from urllib.request import urlopen
import json, re
# Opens the url for the API
url = 'https://jsonplaceholder.typicode.com/posts/1/comments'
r = urlopen(url)
# This should put the response from API in a Dict
result= r.read().decode('utf-8')
data = json.loads(result)
#This shuld get all the names from the the Dict
for name in data['name']: #TypeError here.
print(name)
I know that I could regex the text and get the result that I want.
Code for that:
from urllib.request import urlopen
import re
url = 'https://jsonplaceholder.typicode.com/posts/1/comments'
r = urlopen(url)
result = r.read().decode('utf-8')
f = re.findall('"email": "(\w+\S\w+)', result)
print(f)
But that seems like the wrong way to do this.
Can someone please help me understand what I'm doing wrong here?
data is a list of dicts, that's why you are getting TypeError while iterating on it.
The way to go is something like this:
for item in data: # item is {"name": "foo", "email": "foo#mail..."}
print(item['name'])
print(item['email'])
#PiAreSquared's comment is correct, just a bit more explanation here:
from urllib.request import urlopen
import json, re
# Opens the url for the API
url = 'https://jsonplaceholder.typicode.com/posts/1/comments'
r = urlopen(url)
# This should put the response from API in a Dict
result= r.read().decode('utf-8')
data = json.loads(result)
# your data is a list of elements
# and each element is a dict object, so you can loop over the data
# to get the dict element, and then access the keys and values as you wish
# see below for some example
for element in data: #TypeError here.
name = element['name']
email = element['email']
# if you want to get all names, you should do
names = [element['name'] for element in data]
# same to get all emails
emails = [email['email'] for email in data]

Can't extract JSON from an http request

I'm having problems getting data from an HTTP response. The format unfortunately comes back with '\n' attached to all the key/value pairs. JSON says it must be a str and not "bytes".
I have tried a number of fixes so my list of includes might look weird/redundant. Any suggestions would be appreciated.
#!/usr/bin/env python3
import urllib.request
from urllib.request import urlopen
import json
import requests
url = "http://finance.google.com/finance/info?client=ig&q=NASDAQ,AAPL"
response = urlopen(url)
content = response.read()
print(content)
data = json.loads(content)
info = data[0]
print(info)
#got this far - planning to extract "id:" "22144"
When it comes to making requests in Python, I personally like to use the requests library. I find it easier to use.
import json
import requests
r = requests.get('http://finance.google.com/finance/info?client=ig&q=NASDAQ,AAPL')
json_obj = json.loads(r.text[4:])
print(json_obj[0].get('id'))
The above solution prints: 22144
The response data had a couple unnecessary characters at the head, which is why I am only loading the relevant (json) portion of the response: r.text[4:]. This is the reason why you couldn't load it as json initially.
Bytes object has method decode() which converts bytes to string. Checking the response in the browser, seems there are some extra characters at the beginning of the string that needs to be removed (a line feed character, followed by two slashes: '\n//'). To skip the first three characters from the string returned by the decode() method we add [3:] after the method call.
data = json.loads(content.decode()[3:])
print(data[0]['id'])
The output is exactly what you expect:
22144
JSON says it must be a str and not "bytes".
Your content is "bytes", and you can do this as below.
data = json.loads(content.decode())

Autofill simple web form and retrieve result

I have a colleague having the task submitting gene sequences of Hepatitis C viruses from patient samples into a request form of a specific website which then identifies mutations which provides information about potential drug resistance.
This is very cumbersome and takes days.
My thought is to automate this with a Python script using urllib2 (I cannot use mechanize, I have to develop on MAC OS and for reasons I not understand neither Python setup.py install nor pip mechanize install work - so I am bound to urllib2).
My first try was to access the respective website and first submit a sample gene sequence. (On the original website you simple paste the sequence into an entry field named "or paste in" and then press "go".)
On the next page, you will get the result and I want to read out the mutations via regular expressions.
My first try:
import url lib
import urllib2
url = 'http://hcv.geno2pheno.org/index.php'
form_data = {'or paste in:': 'CTTCACGGAGGCTATGACGAGGTACTCCGCTCCCCCCGGGGACCCCCCCCAACCAGAATACGACTTGGAGCTCATAACATCGTGCTCCTCTAACGTGTCAGTCGCCCACGACGGCGCTGGAAAAAGGGTCTACTACCTTACCCGTGACCCTACAACCCCCCTCGCAAGAGCTGCGTGGGAGACAGCAAGACACACTCCAGTCAATTCCTGGCTAGGCAACATAATCATGTTTGCCCCCACATTGTGGGCGAGAATGATACTGATGACCCACTTCTTCAGTGTCCTCATCGCCAGGGATCAACTTGAACAGGCCCTTGATTGCGAAATCTACGGAGCCTGCTACTCCATTCAACCACTGGACCTACCTCCAATCATTCAAAGACTCCATGGCCTTAGCGCATTTTCACTCCACAGTTACTCTCCAGGTGAAATCAATAGGGTGGCCGCATGCCTCAGGAAACTTGGGGTCCCGCCCTTGCGAGCTTGGAGACACCGGGCCCGGAGCGTCCGCGCTAAGCTTCTGTCCAGAGGAGGCAGGGCTGCCATATGTGGCAAGTACCTCTTCAATTGGGCAGTAAGAACAAAGCTCAAACTCACTCCAATAGCGGCCGCTGGCCAGCTGGACTTGTCCGGCTGGTTCACGGCTGGCTACAGCGGGGGAGACATTTATCACAGCGTGTCTC'}
params = urllib.urlencode(form_data)
response = urllib2.urlopen(url, params)
data = response.read()
print data
What I get from "data" is the source code from http://hcv.geno2pheno.org/index.php and not from the following result page.
Therefore, I have two questions:
1) How can I be sure that my sequence was pasted into the entry field "or paste in:" properly?
2) How do I access the source code of the result page so I can apply regular expressions?
There are a couple of things going wrong here. First, you need more parameters in your form_data dict. Just because you only manually fill in one field doesn't mean that's the only parameter the server needs to complete your request. I've included a form_data dict that worked for me below. The main key you're concerned with is 'v3seq'. This is the sequence you want to "paste in".
Then, when you're requesting the page, you need to use a Request object and read the response of that request. Looks like this:
import urllib
import urllib2
url = 'http://hcv.geno2pheno.org/index.php'
form_data = {
'v3seq': 'CTTCACGGAGGCTATGACGAGGTACTCCGCTCCCCCCGGGGACCCCCCCCAACCAGAATACGACTTGGAGCTCATAACATCGTGCTCCTCTAACGTGTCAGTCGCCCACGACGGCGCTGGAAAAAGGGTCTACTACCTTACCCGTGACCCTACAACCCCCCTCGCAAGAGCTGCGTGGGAGACAGCAAGACACACTCCAGTCAATTCCTGGCTAGGCAACATAATCATGTTTGCCCCCACATTGTGGGCGAGAATGATACTGATGACCCACTTCTTCAGTGTCCTCATCGCCAGGGATCAACTTGAACAGGCCCTTGATTGCGAAATCTACGGAGCCTGCTACTCCATTCAACCACTGGACCTACCTCCAATCATTCAAAGACTCCATGGCCTTAGCGCATTTTCACTCCACAGTTACTCTCCAGGTGAAATCAATAGGGTGGCCGCATGCCTCAGGAAACTTGGGGTCCCGCCCTTGCGAGCTTGGAGACACCGGGCCCGGAGCGTCCGCGCTAAGCTTCTGTCCAGAGGAGGCAGGGCTGCCATATGTGGCAAGTACCTCTTCAATTGGGCAGTAAGAACAAAGCTCAAACTCACTCCAATAGCGGCCGCTGGCCAGCTGGACTTGTCCGGCTGGTTCACGGCTGGCTACAGCGGGGGAGACATTTATCACAGCGTGTCTC',
'H77Switch': '1',
'ignore_sgtSwitch': '1',
'alignwidth': '3',
'action': '1',
'go': 'Go',
'viewResults': '1',
'viewResSec': 'Prediction'
}
data = urllib.urlencode(form_data)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
html_data = response.read()
You can then scrape the data from the response and apply your regular expressions. If you're able to get your pip working, I would also suggest taking a look at BeautifulSoup - it's an excellent library for scraping data from html.

Python 2&3: both urllib & requests POST data mysteriously disappears

I'm using Python to scrape data from a number of web pages that have simple HTML input forms, like the 'Username:' form at the bottom of this page:
http://www.w3schools.com/html/html_forms.asp (this is just a simple example to illustrate the problem)
Firefox Inspect Element indicates this form field has the following HTML structure:
<form name="input0" target="_blank" action="html_form_action.asp" method="get">
Username:
<input name="user" size="20" type="text"></input>
<input value="Submit" type="submit"></input>
</form>
All I want to do is fill out this form and get the resulting page:
http://www.w3schools.com/html/html_form_action.asp?user=ThisIsMyUserName
Which is what is produced in my browser by entering 'ThisIsMyUserName' in the 'Username' field and pressing 'Submit'. However, every method that I have tried (details below) returns the contents of the original page containing the unaltered form without any indication the form data I submitted was recognized, i.e. I get the content from the first link above in response to my request, when I expected to receive the content of the second link.
I suspect the problem has to do with action="html_form_action.asp" in the form above, or perhaps some kind of hidden field I'm missing (I don't know what to look for - I'm new to form submission). Any suggestions?
HERE IS WHAT I'VE TRIED SO FAR:
Using urllib.requests in Python 3:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Encode dict
example_data = example_data.encode('utf-8')
# Create request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
request = urllib.request.Request(example_url, data=example_data)
# Create opener and install
my_url_opener = urllib.request.build_opener() # no handlers
urllib.request.install_opener(my_url_opener)
# Open the page and read content
web_page = urllib.request.urlopen(request)
content = web_page.read()
# Save content to file
my_html_file = open('my_html_file.html', 'wb')
my_html_file.write(content)
But what is returned to me and saved in 'my_html_file.html' is the original page containing
the unaltered form without any indication that my form data was recognized, i.e. I get this page in response: qqqhttp://www.w3schools.com/html/html_forms.asp
...which is the same thing I would have expected if I made this request without the
data parameter at all (which would change the request from a POST to a GET).
Naturally the first thing I did was check whether my request was being constructed properly:
# Just double-checking the request is set up correctly
print("GET or POST?", request.get_method())
print("DATA:", request.data)
print("HEADERS:", request.header_items())
Which produces the following output:
GET or POST? POST
DATA: b'user=ThisIsMyUserName'
HEADERS: [('Content-length', '21'), ('Content-type', 'application/x-www-form-urlencoded'), ('User-agent', 'Python-urllib/3.3'), ('Host', 'www.w3schools.com')]
So it appears the POST request has been structured correctly. After re-reading the
documentation and unsuccessfuly searching the web for an answer to this problem, I
moved on to a different tool: the requests module. I attempted to perform the same task:
import requests
example_url = 'http://www.w3schools.com/html/html_forms.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.post(example_url, params=data_to_send)
contents = response.content
And I get the same exact result. At this point I'm thinking maybe this is a Python 3
issue. So I fire up my trusty Python 2.7 and try the following:
import urllib, urllib2
data = urllib.urlencode({'user' : 'ThisIsMyUserName'})
resp = urllib2.urlopen('http://www.w3schools.com/html/html_forms.asp', data)
content = resp.read()
And I get the same result again! For thoroughness I figured I'd attempt to achieve the
same result by encoding the dictionary values into the url and attempting a GET request:
# Using Python 3
# Construct the url for the GET request
example_url = 'http://www.w3schools.com/html/html_forms.asp'
form_values = {'user': 'ThisIsMyUserName'}
example_data = urllib.parse.urlencode(form_values)
final_url = example_url + '?' + example_data
print(final_url)
This spits out the following value for final_url:
qqqhttp://www.w3schools.com/html/html_forms.asp?user=ThisIsMyUserName
I plug this into my browser and I see that this page is exactly the same as
the original page, which is exactly what my program is downloading.
I've also tried adding additional headers and cookie support to no avail.
I've tried everything I can think of. Any idea what could be going wrong?
The form states an action and a method; you are ignoring both. The method states the form uses GET, not POST, and the action tells you to send the form data to html_form_action.asp.
The action attribute acts like any other URL specifier in an HTML page; unless it starts with a scheme (so with http://..., https://..., etc.) it is relative to the current base URL of the page.
The GET HTTP method adds the URL-encoded form parameters to the target URL with a question mark:
import urllib.request
import urllib.parse
# Create dict of form values
example_data = urllib.parse.urlencode({'user': 'ThisIsMyUserName'})
# Create request
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
get_url = example_url + '?' + example_data
# Open the page and read content
web_page = urllib.request.urlopen(get_url)
print(web_page.read().decode(web_page.info().get_param('charset', 'utf8')))
or, using requests:
import requests
example_url = 'http://www.w3schools.com/html/html_form_action.asp'
data_to_send = {'user': 'ThisIsMyUserName'}
response = requests.get(example_url, params=data_to_send)
contents = response.text
print(contents)
In both examples I also decoded the response to Unicode text (something requests makes easier for me with the response.text attribute).

Categories

Resources