Python Mechanize IncompleteRead Error - python

I am experimenting with mechanize and re to find the websites which correspond to a list of retail stores.
I have been parsing Bing search results to grab the top result's url. Unfortunately, seemingly independent of the query, at random times I've been getting an httplib.IncompleteRead error. Even though I've got a workaround which follows, I'd like to understand what's happening.
def bingSearch(query): #query is the store's name, i.e. "Bob's Pet Shop"
while True:
try:
bingBrowser.open('http://www.bing.com/search?q="' + query.replace(' ','+') + '"' )
htmlCode = bingBrowser.response().read()
break
except httplib.IncompleteRead:
#Sleep for a little while and try again.
Other relevant info:
Sometimes, for a single bing url, the program will to attempt to open and read that url multiple times, before a successful read without an IncompleteRead error.
bingBrowser's headers attribute is set up to look nice.
bingBrowser's robots attribute is set to false.
httplib: incomplete read ... I don't know anything about Apache so I wasn't able to understand the answer to the question, but it may be helpful to you. That said, I doubt that I'm having a similar problem (Why would bing.com be suffering from an Apache error?!)
Edit:
Replaced query.replace(' ','+') + '"' ) with urllib.urlencode(dict(q=query)) per JF Sebastian's suggestion - no change (I know this wasn't proposed as a solution).
Suffered from an inexplicable urllib2.URLError on bingBrowser.open('http://www.bing.com/search?q="' + query.replace(' ','+') + '"' )
Got an xlwt related "String longer than 65535 characters" error - probably unrelated.
Thanks in advance.

Related

How do I make this website recognise my arrays as part of a valid url query?

EDIT:
In a similar vein, when I now try to log into their account with a post request, what is returned is none of the errors they suggest on their site, but is in fact a "JSON exception". Is there any way to debug this, or is an error code 500 completely impossible to deal with?
I'm well aware this question has been asked before. Sadly, when trying the proposed answers, none worked. I have an extremely simple Python project with urllib, and I've never done web programming in Python before, nor am I even a regular Python user. My friend needs to get access to content from this site, but their user-friendly front-end is down and I learned that they have a public API to access their content. Not knowing what I'm doing, but glad to try to help and interested in the challenge, I have very slowly set out.
Note that it is necessary for me to only use standard Python libraries, so that any finished project could easily be emailed to their computer and just work.
The following works completely fine minus the "originalLanguage" query, but when using it, which the API has documented as an array value, no matter whether I comma-separate things, or write "originalLanguage[0]" or "originalLanguage0" or anything that I've seen online, this creates the error message from the server: "Array value expected but string detected" or something along those lines.
Is there any way for me to get this working? Because it clearly can work, otherwise the API wouldn't document it. Many thanks.
In case it helps, when using "[]" or "<>" or "{}" or any delimeter I could think of, my IDE didn't recognise it as part of the URL.
import urllib.request as request
import urllib.parse as parse
def make_query(url, params):
url += "?"
for i in range(len(params)):
url += list(params)[i]
url += '='
url += list(params.values())[i]
if i < len(params) - 1:
url += '&'
return url
base = "https://api.mangadex.org/manga"
params = {
"limit": "50",
"originalLanguage": "en"
}
url = make_query(base, params)
req = request.Request(url)
response = request.urlopen(req)

Python - Socket Error 10054 - How to prevent terminal from preventing error?

Since it is not an execution-fail error, I am not sure what my options are to keep this error from popping up. I do not believe it really matters what my code is that is causing the error if there is some universal command to suppress this error line from printing see my error here
It is simply using whois to determine if the domain is registered or not. I was doing a basic test of the top 1,000 english words to see if their .com domains were taken. code here
Here is my code:
for url in wordlist:
try:
domain = whois.whois(url)
boom.write( ("%s,%s,%s\r\n"% \
(str(number), url, "TAKEN")).encode('UTF-8'))
except:
boom.write( ("%s,%s,%s\r\n"% \
(str(number), url, "NOT TAKEN")).encode('UTF-8'))
A bit hard to know for sure without your code, but wrap the section that's generating the error like this:
try:
# Your error-generating code
except:
pass

Proper way to encode http request string

I faced a trouble to properly encode a query string to use Bing image search API.
I got my account key for using Bing API, which contains "/" and "+". So when I try Bing example query like
http://api.bing.net/json.aspx?AppId=MY_APP_ID&Query=xbox site:microsoft.com&Sources=Image
I got reply that my AppId value is invalid:
{"SearchResponse":{"Version":"2.2","Query":{"SearchTerms":"xbox site\u003amicrosoft.com"},"Errors":[{"Code":1002,"Message":"Parameter has invalid value.","Parameter":"SearchRequest.AppId","Value":"*******\u002*****\u002b***","HelpUrl":"http\u003a\u002f\u002fmsdn.microsoft.com\u002fen-us\u002flibrary\u002fdd251042.aspx"}]}}
Where *** are valid characters of my account key
I tried all possible ways that came to my mind and found in web, but still failed to solve it. So what I tried:
import requests
url = "http://api.bing.net/json.aspx?AppId=****/***+***&Query=xbox site:microsoft.com&Sources=Image"
r = requests.get(url)
I got an error that the value is invalid "****\u002*** ***"
I tried doing the same thing using urllib2, trying to encode and quote both the whole query and the account key only. The code for separately quoting each part of request is like this:
import urllib2
urlStart = u"http://api.bing.net/json.aspx?AppId=%s&Query=xbox"
quotedUrlStart = urllib2.quote(urlStart.encode("utf8"), safe="%/:=&?~#+!$,;'#()*[]")
urlEnd = u" site:microsoft.com&Sources=Image"
quotedUrlEnd = urllib2.quote(urlEnd.encode("utf8"), safe="")
key = u"**/**+**"
quotedKey = urllib2.quote(key.encode("utf8") , safe="%:=&?~#!$,;'#()*[]")
fullUrl = (quotedUrlStart % quotedKey) + quotedUrlEnd
reply = urllib2.urlopen(fullUrl).read()
print reply
I also tried to replace "/" with %2F and "+" with %2B, but the error is the same.
What is a mess for me here is what I have to quote and what not. I actually don't have a clear understanding so far how these things work. I guess that I have to encode everything and quote it different ways - qoute slashes in one place and do not qoute in another.
This question addresses the same issue: XCODE Swift Replacing HTTP-Get App-ID with a space
Also there are numerous questions on SO on escaping symbols, but they were unhelpful for me
I appreciate your time guys

Python - Facebook fb_dtsg

On Facebook I want to find fb_dtsg to make a status:
import urllib, urllib2, cookielib
jar = cookielib.CookieJar()
cookie = urllib2.HTTPCookieProcessor(jar)
opener = urllib2.build_opener(cookie)
data = urllib.urlencode({'email':"email",'pass':"password", "Log+In":"Log+In"})
req = urllib2.Request('http://www.facebook.com/login.php')
opener.open(req, data)
opener.open(req, data) #Needs to be opened twice to log on.
req2 = urllib2.Request("http://www.facebook.com/")
page = opener.open(req2)
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 33] #This just finds the value of "fb_dtsg".
Yes, this does find a value, and a value that looks like fb_dtsg would look like, but this value is always changing when I would open the webpage again and also when I would use it to make a status, it would not work, and when I would record what is happening on google chrome if I was making a status normally, I would get an working fb_dtsg value and it would not change (for a long session), and would work if I used it to try make a status. Please, please show me how I can fix this up without using the API.
The searching criteria to find fb_dtsg truncates last digit, so change 33 to 34
fb_dtsg = page[page.find('name="fb_dtsg"') + 22:page.find('name="fb_dtsg"') + 34]
Anyways you can use a better way of searching the fb_dtsg using re
re.findall('fb_dtsg.+?value="([^"]+)"',page)
As I answered in one of your early posts it may also require other hidden variables also.
If this still doesn't work, can you provide the code where you are making the post including all the post form data
BTW sorry for not looking at all your previous posts with same content :P

python lxml xpath AttributeError (NoneType) with correct xpath and usually working

I am trying to migrate a forum to phpbb3 with python/xpath. Although I am pretty new to python and xpath, it is going well. However, I need help with an error.
(The source file has been downloaded and processed with tagsoup.)
Firefox/Firebug show xpath: /html/body/table[5]/tbody/tr[position()>1]/td/a[3]/b
(in my script without tbody)
Here is an abbreviated version of my code:
forumfile="morethread-alte-korken-fruchtweinkeller-89069-6046822-0.html"
XPOSTS = "/html/body/table[5]/tr[position()>1]"
t = etree.parse(forumfile)
allposts = t.xpath(XPOSTS)
XUSER = "td[1]/a[3]/b"
XREG = "td/span"
XTIME = "td[2]/table/tr/td[1]/span"
XTEXT = "td[2]/p"
XSIG = "td[2]/i"
XAVAT = "td/img[last()]"
XPOSTITEL = "/html/body/table[3]/tr/td/table/tr/td/div/h3"
XSUBF = "/html/body/table[3]/tr/td/table/tr/td/div/strong[position()=1]"
for p in allposts:
unreg=0
username = None
username = p.find(XUSER).text #this is where it goes haywire
When the loop hits user "tompson" / position()=11 at the end of the file, I get
AttributeError: 'NoneType' object has no attribute 'text'
I've tried a lot of try except else finallys, but they weren't helpful.
I am getting much more information later in the script such as date of post, date of user registry, the url and attributes of the avatar, the content of the post...
The script works for hundreds of other files/sites of this forum.
This is no encode/decode problem. And it is not "limited" to the XUSER part. I tried to "hardcode" the username, then the date of registry will fail. If I skip those, the text of the post (code see below) will fail...
#text of getpost
text = etree.tostring(p.find(XTEXT),pretty_print=True)
Now, this whole error would make sense if my xpath would be wrong. However, all the other files and the first numbers of users in this file work. it is only this "one" at position()=11
Is position() uncapable of going >10 ? I don't think so?
Am I missing something?
Question answered!
I have found the answer...
I must have been very tired when I tried to fix it and came here to ask for help. I did not see something quite obvious...
The way I posted my problem, it was not visible either.
the HTML I downloaded and processed with tagsoup had an additional tag at position 11... this was not visible on the website and screwed with my xpath
(It probably is crappy html generated by the forum in combination with tagsoups attempt to make it parseable)
out of >20000 files less than 20 are afflicted, this one here just happened to be the first...
additionally sometimes the information is in table[4], other times in table[5]. I did account for this and wrote a function that will determine the correct table. Although I tested the function a LOT and thought it working correctly (hence did not inlcude it above), it did not.
So I made a better xpath:
'/html/body/table[tr/td[#width="20%"]]/tr[position()>1]'
and, although this is not related, I ran into another problem with unxpected encoding in the html file (not utf-8) which was fixed by adding:
parser = etree.XMLParser(encoding='ISO-8859-15')
t = etree.parse(forumfile, parser)
I am now confident that after adjusting for strange additional and multiple , and tags my code will work on all files...
Still I will be looking into lxml.html, as I mentioned in the comment, I have never used it before, but if it is more robust and may allow for using the files without tagsoup, it might be a better fit and save me extensive try/except statements and loops to fix the few files screwing with my current script...

Categories

Resources