urlib usage from Python 3 - python

I get the following error:
TypeError: POST data should be bytes, an iterable of bytes, or a file object. It cannot be of type str.
When making the call below
import urllib.request, urllib.parse, urllib.error, urllib.request,
urllib.error, urllib.parse
import json
chemcalcURL = 'http://www.chemcalc.org/chemcalc/em'
# Define a molecular formula string
mfRange = 'C0-100H0-100N0-10O0-10'
# target mass
mass = 300
# Define the parameters and send them to Chemcalc
# other options (mass tolerance, unsaturation, etc.
params = {'mfRange': mfRange,'monoisotopicMass': mass}
response = urllib.request.urlopen(chemcalcURL, urllib.parse.urlencode(params))
# Read the output and convert it from JSON into a Python dictionary
jsondata = response.read()
data = json.loads(jsondata)
print(data)

You have to convert your request to bytes which involves the use of bytes() arguement:
response = urllib.request.urlopen(chemcalcURL, bytes(urllib.parse.urlencode(params), encoding="utf-8")
bytes() must take an encoding which for websites is almost always utf-8.

Related

How to Load Text Data into Data Frame from Response Object

I am attempting to convert a curl request into a get-request to pull some data for work and transfer it to a local folder with a parameterized file name. One issue is that the data is only in text format and will not convert to JSON, even after trying multiple methods. Per the response, the data type is "text/tsv; charset=utf-8."
The next issue is that I cannot load the data into a data frame, partially because I am new to Python and do not understand the various methods for doing so, and partially because the formatting makes it more difficult to find an applicable solution. However, I was able to at least break the text into lists by using the splitlines() method. Unfortunately, though, I still cannot load the lists into a data frame. As of the last run, the error message is: "Error: cannot concatenate object of type '<class '_csv.reader'>'; only Series and DataFrame objs are valid."
import requests
import datetime
import petl
import csv
import pandas as pd
import sys
from requests.auth import HTTPBasicAuth
from curlParameters import *
def calculate_year():
current_year = datetime.datetime.now().year
return str(current_year)
def file_name():
name = "CallDetail"
year = calculate_year()
file_type = ".csv"
return name + year + file_type
try:
response = requests.get(url, params=parameters, auth=HTTPBasicAuth(username, password))
except Exception as e:
print("Error:" + str(e))
sys.exit()
if response.status_code == 200:
raw_data = response.text
parsed_data = csv.reader(raw_data.splitlines(), delimiter='\t')
table = pd.DataFrame(columns=[
'contact_id',
'master_contact_id',
'Contact_Code',
'media_name',
'contact_name',
'ani_dialnum',
'skill_no',
'skill_name',
'campaign_no',
'campaign_name',
'agent_no',
'agent_name',
'team_no',
'team_name',
'disposition_code',
'sla',
'start_date',
'start_time',
'PreQueue',
'InQueue',
'Agent_Time',
'PostQueue',
'Total_Time',
'Abandon_Time',
'Routing_Time',
'abandon',
'callback_time',
'Logged',
'Hold_Time'])
try:
for row in table:
table.append(parsed_data)
except Exception as e:
print("Error:" + str(e))
sys.exit()
petl.tocsv(table=table, source=local_source+file_name(), encoding='utf-8', write_header=True)
So, you're trying to append your parsed_data, which is the variable for iterating through your CSV data. I would actually recommend reading the data from the response first, then load it all into the dataframe. This would require a slight restructuring of the code. Something like this:
parsed_data = [row for row in csv.reader(raw_data.splitlines(), delimiter='\t')]
table = pd.DataFrame(parsed_data, columns=your_long_column_list)

Python (requests) - incorrect encoding when fetching headers

I am using requests library (python 3.9) to get filename from URL.[1] For some reason a file name is incorrectly encoded.
I should get "Ogłoszenie_0320.pdf" instead of "OgÅ\x82oszenie_0320.pdf".
My code looks something like this:
import requests
import re
def getFilenameFromRequest(url : str, headers):
# Parses from header information
contentDisposition = headers.get('content-disposition')
if contentDisposition:
filename = re.findall('filename=(.+)', contentDisposition)
print("oooooooooo: " + contentDisposition + " : " + str(filename))
if len(filename) != 0:
return filename[0]
# Parses from url
parsedUrl = urlparse(url)
return os.path.basename(parsedUrl.path)
def getFilenameFromUrl(url : str):
request = requests.head(url)
headers = request.headers
return getFilenameFromRequest(url, headers)
getFilenameFromUrl('https://przedszkolekw.bip.gov.pl'+
'/fobjects/download/880287/ogloszenie-uzp-nr-613234-pdf.html')
Any idea how to fix it?
I know for standard request I can set encoding directly:
request.encoding = 'utf-8'
But what am I supposed to do with this case?
[1]
https://przedszkolekw.bip.gov.pl/fobjects/download/880287/ogloszenie-uzp-nr-613234-pdf.html
Only characters from the ascii based latin-1 should be used as header values [rfc]. Here the file name has been escaped.
>>> s = "Ogłoszenie_0320.pdf"
>>> s.encode("utf8").decode("unicode-escape")
'OgÅ\x82oszenie_0320.pdf'
To reverse the process you can do
>>> sx = 'OgÅ\x82oszenie_0320.pdf'
>>> sx.encode("latin-1").decode("utf8")
'Ogłoszenie_0320.pdf'
(updated after conversation in comments)

Decompress Python requests response with zlib

I'm trying to decompress the response from a web request using Python requests and zlib but I'm not able to decompress the response content properly. Here's my code:
import requests
import zlib
URL = "http://" #omitted real url
r = requests.get(URL)
print r.content
data = zlib.decompress(r.content, lib.MAX_WBITS)
print data
However, I keep getting various errors when changing the wbits parameter.
zlib.error: Error -3 while decompressing data: incorrect header check
zlib.error: Error -3 while decompressing data: invalid stored block lengths
I tried the wbits parameters for deflate, zlip and gzip as noted here zlib.error: Error -3 while decompressing: incorrect header check
But still can't get pass these errors. I'm trying to this in Python, I was given this piece of code that did it with Objective-C but I don't know Objective-C
#import "GTMNSData+zlib.h"
+ (NSData*) uncompress: (NSData*) data
{
Byte *bytes= (Byte*)[data bytes];
NSInteger length=[data length];
NSMutableData* retdata=[[NSMutableData alloc] initWithCapacity:length*3.5];
NSInteger bSize=0;
NSInteger offSet=0;
while (true) {
offSet+=bSize;
if (offSet>=length) {
break;
}
bSize=bytes[offSet];
bSize+=(bytes[offSet+1]<<8);
bSize+=(bytes[offSet+2]<<16);
bSize+=(bytes[offSet+3]<<24);
offSet+=4;
if ((bSize==0)||(bSize+offSet>length)) {
LogError(#"Invalid");
return data;
}
[retdata appendData:[NSData gtm_dataByInflatingBytes: bytes+offSet length:bSize]];
}
return retdata;
}
According to Python requests documentation at:
http://docs.python-requests.org/en/master/user/quickstart/#binary-response-content
it says:
You can also access the response body as bytes, for non-text requests:
>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...
The gzip and deflate transfer-encodings are automatically decoded for you.
If requests understand the encoding, it should therefore already be uncompressed.
Use r.raw if you need to get access to the original data to handle a different decompression mechanism.
http://docs.python-requests.org/en/master/user/quickstart/#raw-response-content
The following is an untested translation of the Objective-C code:
import zlib
import struct
def uncompress(data):
length = len(data)
ret = []
bSize = 0
offSet = 0
while True:
offSet += bSize
if offSet >= length:
break
bSize = struct.unpack("<i", data[offSet:offSet+4])
offSet += 4
if bSize == 0 or bSize + offSet > length:
print "Invalid"
return ''.join(ret)
ret.append(zlib.decompress(data[offSet:offSet+bSize]))
return ''.join(ret)

Assign specific data from a string response to a variable

My code:
#Importing the urllib tool to my program
import urllib.request
#Fetch data from URL
response = urllib.request.urlopen('<URL>')
#Store that response into the variable below
taginfo = response.read()
#Tag info result of search for SSI values
taginforesult = taginfo
#print taginfo
print(taginfo)
The result of the above in Python Shell is correct as follows:
b'LOCATE00016331: tagid="00016331", taggroupid=LOCATE, tagtype=mantis04A, irlocator=null, motion=false, tamper=false, panic=false, lowbattery=false, locationzone="", gpsid="", lastgpsid="", lastgpsts=null, confidencebyrule={}\r\n(CarrierHQ_channel_A: reader=CarrierHQ, channel="A", ssi=-95)\r\n(CarrierHQ_channel_B: reader=CarrierHQ, channel="B", ssi=-99)\r\n\r\n\r\n'
What I want is to know is: how do I select only the SSI=-95 and SSI-99 values from the response above and then insert them to an SSI-A and SSI-B variable?
Do I strip(), findall(), search(), ...?
That's a strange format. But you can easily cut it up to get the parts you want.
ssia = str(taginfo).split("\\r\\n")[1]
.strip("()")
.split(",")[-1]
.strip()
.split("=")[1]
assert ssia == '-95'
ssib = str(taginfo).split("\\r\\n")[2]
.strip("()")
.split(",")[-1]
.strip()
.split("=")[1]
assert ssib == '-99'

Fetching language detection from Google api

I have a CSV with keywords in one column and the number of impressions in a second column.
I'd like to provide the keywords in a url (while looping) and for the Google language api to return what type of language was the keyword in.
I have it working manually. If I enter (with the correct api key):
http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&key=myapikey&q=merde
I get:
{"responseData": {"language":"fr","isReliable":false,"confidence":6.213709E-4}, "responseDetails": null, "responseStatus": 200}
which is correct, 'merde' is French.
so far I have this code but I keep getting server unreachable errors:
import time
import csv
from operator import itemgetter
import sys
import fileinput
import urllib2
import json
E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2
#not working
def parse_result(result):
"""Parse a JSONP result string and return a list of terms"""
# Deserialize JSON to Python objects
result_object = json.loads(result)
#Get the rows in the table, then get the second column's value
# for each row
return row in result_object
#not working
def retrieve_terms(seedterm):
print(seedterm)
"""Retrieves and parses data and returns a list of terms"""
url_template = 'http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&key=myapikey&q=%(seed)s'
url = url_template % {"seed": seedterm}
try:
with urllib2.urlopen(url) as data:
data = perform_request(seedterm)
result = data.read()
except:
sys.stderr.write('%s\n' % 'Could not request data from server')
exit(E_OPERATION_ERROR)
#terms = parse_result(result)
#print terms
print result
def main(argv):
filename = argv[1]
csvfile = open(filename, 'r')
csvreader = csv.DictReader(csvfile)
rows = []
for row in csvreader:
rows.append(row)
sortedrows = sorted(rows, key=itemgetter('impressions'), reverse = True)
keys = sortedrows[0].keys()
for item in sortedrows:
retrieve_terms(item['keywords'])
try:
outputfile = open('Output_%s.csv' % (filename),'w')
except IOError:
print("The file is active in another program - close it first!")
sys.exit()
dict_writer = csv.DictWriter(outputfile, keys, lineterminator='\n')
dict_writer.writer.writerow(keys)
dict_writer.writerows(sortedrows)
outputfile.close()
print("File is Done!! Check your folder")
if __name__ == '__main__':
start_time = time.clock()
main(sys.argv)
print("\n")
print time.clock() - start_time, "seconds for script time"
Any idea how to finish the code so that it will work? Thank you!
Try to add referrer, userip as described in the docs:
An area to pay special attention to
relates to correctly identifying
yourself in your requests.
Applications MUST always include a
valid and accurate http referer header
in their requests. In addition, we
ask, but do not require, that each
request contains a valid API Key. By
providing a key, your application
provides us with a secondary
identification mechanism that is
useful should we need to contact you
in order to correct any problems. Read
more about the usefulness of having an
API key
Developers are also encouraged to make
use of the userip parameter (see
below) to supply the IP address of the
end-user on whose behalf you are
making the API request. Doing so will
help distinguish this legitimate
server-side traffic from traffic which
doesn't come from an end-user.
Here's an example based on the answer to the question "access to google with python":
#!/usr/bin/python
# -*- coding: utf-8 -*-
import json
import urllib, urllib2
from pprint import pprint
api_key, userip = None, None
query = {'q' : 'матрёшка'}
referrer = "https://stackoverflow.com/q/4309599/4279"
if userip:
query.update(userip=userip)
if api_key:
query.update(key=api_key)
url = 'http://ajax.googleapis.com/ajax/services/language/detect?v=1.0&%s' %(
urllib.urlencode(query))
request = urllib2.Request(url, headers=dict(Referer=referrer))
json_data = json.load(urllib2.urlopen(request))
pprint(json_data['responseData'])
Output
{u'confidence': 0.070496580000000003, u'isReliable': False, u'language': u'ru'}
Another issue might be that seedterm is not properly quoted:
if isinstance(seedterm, unicode):
value = seedterm
else: # bytes
value = seedterm.decode(put_encoding_here)
url = 'http://...q=%s' % urllib.quote_plus(value.encode('utf-8'))

Categories

Resources