Dealing with HTTP IncompleteRead Error in Python - python

I am trying to understand how to handle a http.client.IncompleteRead Error in the code below. I handle the error using the idea in this post. Basically, I thought it might just be the server limiting the number of times I can access the data, but strangely I get a HTTP Status Code of 200 sometimes, but still the code below ends up returning a None type. Is this because zipdata = e.partial is not returning anything when the Error comes up?
def update(self, ABBRV):
if self.ABBRV == 'cd':
try:
my_url = 'http://www.bankofcanada.ca/stats/results/csv'
data = urllib.parse.urlencode({"lookupPage": "lookup_yield_curve.php",
"startRange": "1986-01-01",
"searchRange": "all"})
binary_data = data.encode('utf-8')
req = urllib.request.Request(my_url, binary_data)
result = urllib.request.urlopen(req)
print('status:: {},{}'.format(result.status, my_url))
zipdata = result.read()
zipfile = ZipFile(BytesIO(zipdata))
df = pd.read_csv(zipfile.open(zipfile.namelist()[0]))
df = pd.melt(df, id_vars=['Date'])
return df
#In case of http.client.IncompleteRead Error
except http.client.IncompleteRead as e:
zipdata = e.partial
Thank You

Hmm... I've run your code 20 times without getting an incomplete read error, why don't you just retry in the case of an incomplete read? Or on the other hand if your IP is being blocked, then it would make sense that they're not returning you anything. Your code could look something like this:
maxretries = 3
attempt = 0
while attempt < maxretries:
try:
#http access code goes in here
except http.client.IncompleteRead:
attempt += 1
else:
break

Related

Changing variable for parameters for http json get requests

Pretty new to python so go easy on me :). This code works below but I was wondering if there is a way to change the indcode parameter by doing a loop so I do not have to repeat the requests.get.
paraD = dict()
paraD["area"] = "123"
paraD["periodtype"] = "2"
paraD["indcode"] = "722"
paraD["$limit"]=1000
#Open URL and get data for business indcode 722
document_1 = requests.get(dataURL, params=paraD)
bizdata_1 = document_1.json()
#Open URL and get data for business indcode 445
paraD["indcode"] = "445"
document_2 = requests.get(dataURL, params=paraD)
bizdata_2 = document_2.json()
#Open URL and get data for business indcode 311
paraD["indcode"] = "311"
document_3 = requests.get(dataURL, params=paraD)
bizdata_3 = document_3.json()
#Combine the three lists
output = bizdata_1 + bizdata_2 + bizdata_3
Since indcode is the only parameter that changes for each request, we will put that in a list and make the web requests inside a loop.
data_url = ""
post_params = dict()
post_params["area"] = "123"
post_params["periodtype"] = "2"
post_params["$limit"]=1000
# The list of indcode values
ind_codes = ["722", "445", "311"]
output = []
# Loop on indcode values
for code in ind_codes:
# Change indcode parameter value in the loop
post_params["indcode"] = code
try:
response = requests.get(data_url, params=post_params)
data1 = response.json()
output.append(data1)
except:
print("web request failed")
# More error handling / retry if required
print(output)
Assuming you're using Python 3.9+ you can combine dictionaries using the | operator. However, you need to be sure that you understand exactly what this will do. It's more likely that the code to combine dictionaries will be more complex.
When using the requests module it is very important to check the HTTP status code returned from the function (HTTP verb) you're calling.
Here's an approach to the stated problem that may work (depending on how the dictionary merge is effected).
from urllib.error import HTTPError
from requests import get as GET
from requests.exceptions import Timeout, TooManyRedirects, RequestException
from sys import stderr
# base parameters
params = {'area': '123', 'periodtype': '2', '$limit': 1000}
# the indcodes
indcodes = ('722', '445', '311')
# gets the JSON response (as a Python dictionary)
def getjson(url, params):
try:
(r := GET(url, params, timeout=1.0)).raise_for_status()
return r.json() # all good
# if we get any of these exceptions, report to stderr and return an empty dictionary
except (HTTPError, ConnectionError, Timeout, TooManyRedirects, RequestException) as e:
print(e, file=stderr)
return {}
# any exception here is not associated with requests/urllib. Report and raise
except Exception as f:
print(f, file=stderr)
raise
# an empty dictionary
target = {}
# build the target dictionary
# May not produce desired results depending on how the dictionary merge should be carried out
for indcode in indcodes:
target |= getjson('https://httpbin.org/json', params | {'indcode' : indcode})

How to Load Text Data into Data Frame from Response Object

I am attempting to convert a curl request into a get-request to pull some data for work and transfer it to a local folder with a parameterized file name. One issue is that the data is only in text format and will not convert to JSON, even after trying multiple methods. Per the response, the data type is "text/tsv; charset=utf-8."
The next issue is that I cannot load the data into a data frame, partially because I am new to Python and do not understand the various methods for doing so, and partially because the formatting makes it more difficult to find an applicable solution. However, I was able to at least break the text into lists by using the splitlines() method. Unfortunately, though, I still cannot load the lists into a data frame. As of the last run, the error message is: "Error: cannot concatenate object of type '<class '_csv.reader'>'; only Series and DataFrame objs are valid."
import requests
import datetime
import petl
import csv
import pandas as pd
import sys
from requests.auth import HTTPBasicAuth
from curlParameters import *
def calculate_year():
current_year = datetime.datetime.now().year
return str(current_year)
def file_name():
name = "CallDetail"
year = calculate_year()
file_type = ".csv"
return name + year + file_type
try:
response = requests.get(url, params=parameters, auth=HTTPBasicAuth(username, password))
except Exception as e:
print("Error:" + str(e))
sys.exit()
if response.status_code == 200:
raw_data = response.text
parsed_data = csv.reader(raw_data.splitlines(), delimiter='\t')
table = pd.DataFrame(columns=[
'contact_id',
'master_contact_id',
'Contact_Code',
'media_name',
'contact_name',
'ani_dialnum',
'skill_no',
'skill_name',
'campaign_no',
'campaign_name',
'agent_no',
'agent_name',
'team_no',
'team_name',
'disposition_code',
'sla',
'start_date',
'start_time',
'PreQueue',
'InQueue',
'Agent_Time',
'PostQueue',
'Total_Time',
'Abandon_Time',
'Routing_Time',
'abandon',
'callback_time',
'Logged',
'Hold_Time'])
try:
for row in table:
table.append(parsed_data)
except Exception as e:
print("Error:" + str(e))
sys.exit()
petl.tocsv(table=table, source=local_source+file_name(), encoding='utf-8', write_header=True)
So, you're trying to append your parsed_data, which is the variable for iterating through your CSV data. I would actually recommend reading the data from the response first, then load it all into the dataframe. This would require a slight restructuring of the code. Something like this:
parsed_data = [row for row in csv.reader(raw_data.splitlines(), delimiter='\t')]
table = pd.DataFrame(parsed_data, columns=your_long_column_list)

pycurl.multicurl - Cant get my the returned results, only the object memory address

I have been trying to get this example code working https://fragmentsofcode.wordpress.com/2011/01/22/pycurl-curlmulti-example/ for a bit tonight and it has me stumped (not difficult).
m = pycurl.CurlMulti()
for url in urllist:
response = io.StringIO()
handle = pycurl.Curl()
handle.setopt(pycurl.URL, url)
handle.setopt(pycurl.WRITEDATA, response)
req = (url, response)
m.add_handle(handle)
reqs.append(req)
# Perform multi-request.
# This code copied from pycurl docs, modified to explicitly
# set num_handles before the outer while loop.
SELECT_TIMEOUT = 1.0
num_handles = len(reqs)
while num_handles:
ret = m.select(SELECT_TIMEOUT)
if ret == -1:
continue
while 1:
ret, num_handles = m.perform()
if ret != pycurl.E_CALL_MULTI_PERFORM:
break
for req in reqs:
# req[1].getvalue() contains response content
print(req[1].getvalue())
I have tried the following prints
req[1]
req[1].getvalue()
req[1].getvalue().decode('ISO-8859-1')
The closest I have come is printing something like the following for all fetched URL's
_io.StringIO object at 0x7f19f931d9d8
I have also tried using io.BytesIO with the same results.
Comments in the code are from the original author, and I have changed some minor things that I think are version dependent, and removed the comments related.
How can I print the contents of the object at that memory address rather than just the address?
EDIT:
print(req[1].getvalue()) raises the following error
print(req[1].getalue())
AttributeError: '_io.BytesIO' object has no attribute 'getalue
so I squished it into variable like so
temp = req[1].getvalue()
print(temp)
and it just returned a b with single quotes like so:
b''

KeyError Issue when decoding JSON

I am making telegram bot using python 3 on RPI and for HTTP requesting I used requests library
I wrote the code that should answer &start command:
import requests as rq
updateURL="https://api.telegram.org/bot925438333:AAGEr3pf3c4Fz91sL79mwJ6aGYm-Y6BM7_4/getUpdates"
while True:
r=rq.post(url = updateURL)
data = r.json()
messageArray = data['result']
lastMsgID=len(messageArray)-1
lastMsgData = messageArray[lastMsgID]
lastMsgSenderID = lastMsgData['message']['from']['id']
lastMsgUsername = lastMsgData['message']['from']['username']
lastMsgText = lastMsgData["message"]["text"]
lastMsgChatType = lastMsgData['message']['chat']['type']
if lastMsgChatType == "group":
lastMsgGroupID = lastMsgData['message']['chat']['id']
if lastMsgText == "&start":
if lastMsgChatType == "private":
URL="https://api.telegram.org/bot925438333:AAGEr3pf3c4Fz91sL79mwJ6aGYm-Y6BM7_4/sendMessage"
chatText="Witamy w KozelBot"
chatID=lastMsgSenderID
Params={"chat_id":chatID,"text":chatText}
rs = rq.get(url = URL, params = Params)
if lastMsgChatType == "group":
URL="https://api.telegram.org/bot925438333:AAGEr3pf3c4Fz91sL79mwJ6aGYm-Y6BM7_4/sendMessage"
chatText="Witamy w KozelBot"
chatID=lastMsgGroupID
Params={"chat_id":chatID,"text":chatText}
rs = rq.get(url = URL, params = Params)
but the code outputs an error:
File "/home/pi/telegramResponse.py", line 16, in
lastMsgText = lastMsgData["message"]["text"]
KeyError: 'text'
I don't know how to solve this problem because this fragment is working fine in my other scripts!
Please help!
The reason was simple!
Last message that program finds doesn't contain any text because it was new user notification. KeyError just occured because last message doesn't have ['text'] parameter.

pool.map list index out of range python

there is about 70% chance shows error:
res=pool.map(feng,urls)
File "c:\Python27\lib\multiprocessing\pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "c:\Python27\lib\multiprocessing\pool.py", line 567, in get
raise self._value
IndexError: list index out of range
don't know why,if data less then 100,only 5%chance show that message.any one have idea how to improve?
#coding:utf-8
import multiprocessing
import requests
import bs4
import re
import string
root_url = 'http://www.haoshiwen.org'
#index_url = root_url+'/type.php?c=1'
def xianqin_url():
f = 0
h = 0
x = 0
y = 0
b = []
l=[]
for i in range(1,64):#页数
index_url=root_url+'/type.php?c=1'+'&page='+"%s" % i
response = requests.get(index_url)
soup = bs4.BeautifulSoup(response.text,"html.parser")
x = [a.attrs.get('href') for a in soup.select('div.sons a[href^=/]')]#取出每一页的div是sons的链接
c=len(x)#一共c个链接
j=0
for j in range(c):
url = root_url+x[j]
us = str(url)
print "收集到%s" % us
l.append(url) #pool = multiprocessing.Pool(8)
return l
def feng (url) :
response = requests.get(url)
response.encoding='utf-8'
#print response.text
soup = bs4.BeautifulSoup(response.text, "html.parser")
#content = soup.select('div.shileft')
qq=str(soup)
soupout = re.findall(r"原文(.+?)</div>",qq,re.S)#以“原文”开头<div>结尾的字段
#print soupout[1]
content=str(soupout[1])
b="风"
cc=content.count(b,0,len(content))
return cc
def start_process():
print 'Starting',multiprocessing.current_process().name
def feng (url) :
response = requests.get(url)
response.encoding='utf-8'
#print response.text
soup = bs4.BeautifulSoup(response.text, "html.parser")
#content = soup.select('div.shileft')
qq=str(soup)
soupout = re.findall(r"原文(.+?)</div>",qq,re.S)#以“原文”开头<div>结尾的字段
#print soupout[1]
content=str(soupout[1])
b="风"
c="花"
d="雪"
e="月"
f=content.count(b,0,len(content))
h=content.count(c,0,len(content))
x=content.count(d,0,len(content))
y=content.count(e,0,len(content))
return f,h,x,y
def find(urls):
r= [0,0,0,0]
pool=multiprocessing.Pool()
res=pool.map4(feng,urls)
for i in range(len(res)):
r=map(lambda (a,b):a+b, zip(r,res[i]))
return r
if __name__=="__main__":
print "开始收集网址"
qurls=xianqin_url()
print "收集到%s个链接" % len(qurls)
print "开始匹配先秦诗文"
find(qurls)
print '''
%s篇先秦文章中:
---------------------------
风有:%s
花有:%s
雪有:%s
月有:%s
数据来源:%s
''' % (len(qurls),find(qurls)[0],find(qurls)[1],find(qurls)[2],find(qurls)[3],root_url)
stackoverflow :Body cannot contain "`pool ma p".
changed it as res=pool.map4(feng,urls)
i'm trying to get some sub string from this website,with multiprocessing.
Indeed, multiprocessing makes it a bit hard to debug as you don't see where the index out of bound error occurred (the error message makes it appear as if it happened internally in the multiprocessing module).
In some cases this line:
content=str(soupout[1])
raises an index out of bound, because soupout is an empty list. If you change it to
if len(soupout) == 0:
return None
and then remove the None that were returned by changing
res=pool.map(feng,urls)
into
res = pool.map(feng,urls)
res = [r for r in res if r is not None]
then you can avoid the error. That said. You probably want to find out the root cause why re.findall returned an empty list. It is certainly a better idea to select the node with beatifulsoup than with regex, as generally matching with bs4 is more stable, especially if the website slightly changes their markup (e.g. whitespaces, etc.)
Update:
why is soupout is an empty list? When I didn't use pool.map never I have this error message shown
This is probably because you hammer the web server too fast. In a comment you mention that you sometimes get 504 in response.status_code. 504 means Gateway Time-out: The server was acting as a gateway or proxy and did not receive a timely response from the upstream server
This is because haoshiwen.org seems to be powered by kangle which is a reverse proxy. Now the reverse proxy handles back all the requests you send him to the web server behind, and if you now start too many processes at once the poor web server cannot handle the flood. Kangle has a default timeout of 60s so as soon as he doesn't get an answer back from the web server within 60s he shows the error you posted.
How do you fix that?
you could limit the number of processes: pool=multiprocessing.Pool(2), you'd need to play around with a good number of processes
at the top of feng(url) you could add a time.sleep(5) so each process waits 5 seconds between each request. Also here you'd need to play around with the sleep time.

Categories

Resources