Parse response in python - python

When i sending some data on host:
r = urllib2.Request(url, data = data, headers = headers)
page = urllib2.urlopen(r)
soup = BeautifulSoup(page.read(), fromEncoding="cp-1251")
print page.read()
i have something like this:
[{"command":"settings","settings":{"basePath":"\/","ajaxPageState":{"theme":"spsr","theme_token":"kRHUhchUVpxAMYL8Y8IoyYIcX0cPrUstziAi8gSmMYk","css":[]},"ajax":{"edit-submit":{"callback":"spsr_calculator_form_ajax","wrapper":"calculator_form","method":"replaceWith","event":"mousedown","keypress":true,"url":"\/ru\/system\/ajax","submit":{"_triggering_element_name":"submit"}}}},"merge":true},{"command":"insert","method":null,"selector":null,"data":"\u003cdiv id=\"calculator_form\"\u003e\u003cform action=\"\/ru\/service\/calculator\" method=\"post\" id=\"spsr-calculator-form\" accept-charset=\"UTF-8\"\u003e\u003cdiv\u003e\u003cinput id=\"edit-from-ship-region-id\" type=\"hidden\" name=\"from_ship_region_id\" value=\"\" \/\u003e\n\u003cinput type=\"hidden\" name=\"form_build_id\" value=\"form-0RK_WFli4b2kUDTxpoqsGPp14B_0yf6Fz9x7UK-T3w8\" \/\u003e\n\u003cinput type=\"hidden\" name=\"form_id\" value=\"spsr_calculator_form\" \/\u003e\n\u003c\/div\u003e\n\u003cdiv class=\"bg_p\"\u003e \n\u0421\u0435\u0439\u0447\u0430\u0441 \u0412\u044b... bla bla bla
but i want have something, like this:
<html><h1>bla bla bla</h1></html>
How can i do it?

The answer you are getting is very likely encoded in JSON. If this is true then using BeautifulSoup doesn't make any sense (it is a HTML/XML parser). If you have JSON data you will need to use a JSON parser. Calling page.read() twice doesn't make any sense either since it won't return you anything sane after the first call.
Rewriting your request part we get:
r = urllib2.Request(url, data = data, headers = headers)
page = urllib2.urlopen(r)
data = page.read()
Now instead of an HTML parser, we need to use a JSON parser. This can be done with json library (in Python since 2.6):
import json
decoded_data = json.loads(data)
Now, just locate which part of the model you want to extract. Considering your example and give you want to print out the section with "blabla", you can write:
result = unicode(decoded_data[1][u'data'])
For debugging try:
print result

Related

How do I automatically change a part of a url to query a website a set number of times in python?

I have very basic knowledge of python, so sorry if my question sounds dumb.
I need to query a website for a personal project I am doing, but I need to query it 500 times, and each time I need to change 1 specific part of the url, then take the data and upload it to gsheets.
(The () signifies what part of the url I need to change)
'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=(symbol)&apikey=apikey'
I thought about using while and format {} to do it, but I was unsure how to change the string each time, bar writing out the names for variables by hand (defeating the whole purpose of this).
I already have a list of the symbols I need to use, but I don't know how to input them
Example of how I get 1 piece of data
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=MMM&apikey=demo'
r = requests.get(url)
data = r.json()
Example of what I'd like to change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=AOS&apikey=demo'
r = requests.get(url)
data = r.json()
#then change it to
import requests
url = 'https://www.alphavantage.co/query?function=BALANCE_SHEET&symbol=ABT&apikey=demo'
r = requests.get(url)
data = r.json()
so on and so forth, 500 times.
You might combine .format with for loop, consider following simple example
symbols = ["abc","xyz","123"]
for s in symbols:
url = 'https://www.example.com?symbol={}'.format(s)
print(url)
output
https://www.example.com?symbol=abc
https://www.example.com?symbol=xyz
https://www.example.com?symbol=123
You might also elect to use any other way of formatting, e.g. f-string (requires python3.6 or newer) in which case code would be
symbols = ["abc","xyz","123"]
for s in symbols:
url = f'https://www.example.com?symbol={s}'
print(url)
Alternatively you might params optional argument of requests.get function as follows
import requests
symbols = ["abc","xyz","123"]
for s in symbols:
r = requests.get('https://www.example.com', params={'symbol':s})
print(r.url)
output
https://www.example.com/?symbol=abc
https://www.example.com/?symbol=xyz
https://www.example.com/?symbol=123

Response 404, but URL is reachable from python

I have a webcrawler, but currently the 404 error occurs when calling requests.get(url) from the requests module. But the URL is reachable.
base_url = "https://www.blogger.com/profile/"
site = base_url + blogs_to_visit.pop().rsplit('/', 1)[-1]
r = requests.get(site)
soup = BeautifulSoup(r.content, "html.parser")
# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024
>>> print r
<Response [404]>
However, if I hardcore the string site for the requests module as the exact same string. The response is 202.
site = "https://www.blogger.com/profile/01785989747304686024"
# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024
>>> print r
<Response [202]>
What just striked me is that it looks like a hidden newline after printing site the first time, might that be what's causing the problem?
The URL's to visit is earlier stored in a file with;
for link in soup.select("h2 a[href]"):
blogs.write(link.get("href") + "\n")
and fetched with
with open("foo") as p:
return p.readlines()
The question is then, what would be a better way of writing them to the file? If I dont seperate them with "\n" for eg, all the URL's are glued together as one.
In reference to Getting rid of \n when using .readlines(), perhaps use:
with open("foo") as p:
return p.read().splitlines()
you can use:
r = requests.get(site.strip('\n'))
instead of:
r = requests.get(site)

Decoding Google Speech API response in python

I'm trying to use the Google Speech API in Python. I load a .flac file like this:
url = "https://www.google.com/speech-api/v1/recognize?xjerr=1&client=chromium&lang=en-US"
audio = open('temp_voice.flac','rb').read()
headers = {'Content-Type': 'audio/x-flac; rate=44100', 'User-Agent':'Mozilla/5.0'}
req = urllib2.Request(url, data=audio, headers=headers)
resp = urllib2.urlopen(req)
system("rm temp_voice.wav; rm temp_voice.flac")
print resp.read()
Output:
{"status":0,"id":"","hypotheses":[{"utterance":"Today is Wednesday","confidence":0.75135982}]}
Can someone please teach me how I can extract and save the text "Today is Wednesday" as a variable and print it?
You can use json.loads to convert the JSON data to a dict, like this
data = '{"status":0,"id":"","hypotheses":[{"utterance":"Today is Wednesday","confidence":0.75135982}]}'
import json
data = json.loads(data)
print data["hypotheses"][0]["utterance"]
If the response is coming as a string then you can just eval it to a dictionary, (for safety it is preferable to use literal_eval from the ast library instead):
>>> d=eval('{"status":0,"id":"","hypotheses":[{"utterance":"Today is Wednesday","confidence":0.75135982}]}')
>>> d
{'status': 0, 'hypotheses': [{'confidence': 0.75135982, 'utterance': 'Today is Wednesday'}], 'id': ''}
>>> h=d.get('hypotheses')
>>> h
[{'confidence': 0.75135982, 'utterance': 'Today is Wednesday'}]
>>> for i in h:
... print i.get('utterance')
...
Today is Wednesday
Of course if it is already a dictionary then you do not need to do the evaluate, try using print type(response) where response is the result you are getting.
The problem with retrieve output is bit more complicate that looks. At first resp is type of instance, however if you copy the output manually is dictionary->list->dictionary. If you assign the resp.read() to new variable you will get type string with length 0. It happens, because the all output vanish into air once is used (print). Therefore the json decoding has to be done as soon the respond from google api is granted. As follow:
resp = urllib2.urlopen(req)
text = json.loads(resp.read())["hypotheses"][0]["utterance"]
Works like a charm in my case ;)

Simple POST using urllib with Python 3.3

I'm participating in a hash-breaking contest and I'm trying to automate posting strings to an html form and getting the hash score back. So far I've managed to get SOMETHING posted to the url, but its not the exact string I'm expecting, and thus the value returned for the hash is way off from the one obtained by just typing in the string manually.
import urllib.parse, urllib.request
url = "http://almamater.xkcd.com/?edu=una.edu"
data = "test".encode("ascii")
header = {"Content-Type":"application/octet-stream"}
req = urllib.request.Request(url, data, header)
f = urllib.request.urlopen(req)
print(f.read())
#parse f to pull out hash
I obtain the following hash from the site:
0fff9563bb3279289227ac77d319b6fff8d7e9f09da1247b72a0a265cd6d2a62645ad547ed8193db48cff847c06494a03f55666d3b47eb4c20456c9373c86297d630d5578ebd34cb40991578f9f52b18003efa35d3da6553ff35db91b81ab890bec1b189b7f52cb2a783ebb7d823d725b0b4a71f6824e88f68f982eefc6d19c6
This differs considerably from what I expected, which is what you get if you type in "test" (no quotes) into the form:
e21091dbb0d61bc93db4d1f278a04fe1a51165fb7262c7da31f886ae09ff3e04c41483c500db2792c59742958d8f7f39fe4f4f2cdc7940b7b25e3289b89d344e06f76305b9de525933b5df5dae2a37388f82cf76374fe363587acfb49b9d2c8fc131ef4a32c762be083b07330989b298d60e312f56a6b8a4c0f53c9b59864fb7
Obviously the code isn't doing what I'm expecting it to do. Any tips?
When you submit your form data, it also includes the field name, so when you submit "test" the data submitted actually looks like "hashable=test". Try changing your data like this:
data = "hashable=test".encode("ascii")
or alternatively:
data = urllib.parse.urlencode({'hashable': 'test'})

Parsing XML response of bit.ly

I was trying out the bit.ly api for shorterning and got it to work. It returns to my script an xml document. I wanted to extract out the tag but cant seem to parse it properly.
askfor = urllib2.Request(full_url)
response = urllib2.urlopen(askfor)
the_page = response.read()
So the_page contains the xml document. I tried:
from xml.dom.minidom import parse
doc = parse(the_page)
this causes an error. what am I doing wrong?
You don't provide an error message so I can't be sure this is the only error. But, xml.minidom.parse does not take a string. From the docstring for parse:
Parse a file into a DOM by filename or file object.
You should try:
response = urllib2.urlopen(askfor)
doc = parse(response)
since response will behave like a file object. Or you could use the parseString method in minidom instead (and then pass the_page as the argument).
EDIT: to extract the URL, you'll need to do:
url_nodes = doc.getElementsByTagName('url')
url = url_nodes[0]
print url.childNodes[0].data
The result of getElementsByTagName is a list of all nodes matching (just one in this case). url is an Element as you noticed, which contains a child Text node, which contains the data you need.
from xml.dom.minidom import parseString
doc = parseString(the_page)
See the documentation for xml.dom.minidom.

Categories

Resources