I try to scrape website and print meta fields. I got an UnicodeEncodeError but I resolved it by using chcp 65001 in my terminal (I am using Windows). Now it works fine but some sites gives me strange results. I get "ćwiczenia" instead of "Ćwiczenia". Other sites gives proper values ("Ćwiczenia" for example).
Why one time it is ok and another time it is not?
This is my method:
def description(self):
description = self.soup.find_all(attrs={'name':
['description', 'Description']})
if description:
return(description[0]['content'])
It gives me good result on one page. On another
if description:
return(description[0]['content'].encode("windows-1252").decode("utf-8"))
fix it (I get proper encoding) but when I open previous site with this method I get an error:
"'charmap' codec can't encode character '\u015b' in position 69"
How can I solve this?
Related
Recently i started working on a project. In which i am sending data from a html page using <a> tag. At backend i am using python webapp2 feamework. When i get the data it show perfectly. But when i compare it with some string for further usage it does not work.
I know when we get data it is in unicode. But i converted it to utf-8 and it is still not working.
Here is the code in html. Suppose i sent "item 2" as itemname
Click me
The code which i am using to fetch data is
def get(self,nam,des):
nam = self.request.get('itemname')
itemDesc= self.request.get('itemdescription')
name = nam.encode('utf-8')
if name == "item 2":
self.response.write("Equal")
I also try it without encoding but still it not works. It Show the value of item name perfectly. But it is not comparing them. Please help where i am doing mistake.
Looks like 2 issues, there is a space before the expected value is set, and the string is urlquoted.
To fix the space:
Click me
And then to work with the encoding, add import urllib and change the line
name = nam.encode('utf-8')
to
name = urllib.unquote(nam.encode('utf-8'))
Recently I encountered the following problem:
I have an array of strings:
name in ['Mueller', 'Meier', 'Schulze', 'Schmidt']
I face problems with its encoding in Python 3:
name.encode('cp1252')
Here is the full snippet:
target_name = [name.encode('cp1252')
for name in
['Mueller', 'Meier', 'Schulze', 'Schmidt']]
assert_array_equal(arr['surname'],
target_name)
And here is the point where I also get the error. The error states:
Fail in test..... dtype='<|S7>'
I've been searching for a solution for some time, what I found so far is the need of changing the encoding. I applied:
name = np.char.encode('cp1252')
However I get another type of error with it.
Could someone help me with the error tracking?
I have a server as app and all works ok except when I save ajax forms. If I save from python script - with right input - data are returned as unicode. But the data from js is strange: on pipe should be only bytes(that's the only data type http knows) , but bottle show me str (it is not utf-8) and I can't encode/decode to get correct value. On js side I try with jquery and form.serialise, works for other frameworks.
#post('/agt/save')
def saveagt():
a = Agent({x: request.forms.get(x) for x in request.forms})
print(a.nume, a.nume.encode())
return {'ret': ags.add(a)}
... and a name like „țânțar” become „ÈânÈar”.
It may be a simple problem, but I think I didn't drink enough coffee yet.
If anyone is curious, bottle don't handle corect the url.
So urllib.parse.unquote(request.body.read().decode()) solve problem.
or
d = urllib.parse.parse_qs(request.body.read().decode())
a = Agent({x: d[x][0] for x in d})
in my case.
Is it a bug of bottle? Or should I tell him to decode URI and I don't know how?
Use
request.forms.getunicode('some_form_field_name')
as shorthand, if you want to get around the character conversion to latin-1.
I am trying to migrate a forum to phpbb3 with python/xpath. Although I am pretty new to python and xpath, it is going well. However, I need help with an error.
(The source file has been downloaded and processed with tagsoup.)
Firefox/Firebug show xpath: /html/body/table[5]/tbody/tr[position()>1]/td/a[3]/b
(in my script without tbody)
Here is an abbreviated version of my code:
forumfile="morethread-alte-korken-fruchtweinkeller-89069-6046822-0.html"
XPOSTS = "/html/body/table[5]/tr[position()>1]"
t = etree.parse(forumfile)
allposts = t.xpath(XPOSTS)
XUSER = "td[1]/a[3]/b"
XREG = "td/span"
XTIME = "td[2]/table/tr/td[1]/span"
XTEXT = "td[2]/p"
XSIG = "td[2]/i"
XAVAT = "td/img[last()]"
XPOSTITEL = "/html/body/table[3]/tr/td/table/tr/td/div/h3"
XSUBF = "/html/body/table[3]/tr/td/table/tr/td/div/strong[position()=1]"
for p in allposts:
unreg=0
username = None
username = p.find(XUSER).text #this is where it goes haywire
When the loop hits user "tompson" / position()=11 at the end of the file, I get
AttributeError: 'NoneType' object has no attribute 'text'
I've tried a lot of try except else finallys, but they weren't helpful.
I am getting much more information later in the script such as date of post, date of user registry, the url and attributes of the avatar, the content of the post...
The script works for hundreds of other files/sites of this forum.
This is no encode/decode problem. And it is not "limited" to the XUSER part. I tried to "hardcode" the username, then the date of registry will fail. If I skip those, the text of the post (code see below) will fail...
#text of getpost
text = etree.tostring(p.find(XTEXT),pretty_print=True)
Now, this whole error would make sense if my xpath would be wrong. However, all the other files and the first numbers of users in this file work. it is only this "one" at position()=11
Is position() uncapable of going >10 ? I don't think so?
Am I missing something?
Question answered!
I have found the answer...
I must have been very tired when I tried to fix it and came here to ask for help. I did not see something quite obvious...
The way I posted my problem, it was not visible either.
the HTML I downloaded and processed with tagsoup had an additional tag at position 11... this was not visible on the website and screwed with my xpath
(It probably is crappy html generated by the forum in combination with tagsoups attempt to make it parseable)
out of >20000 files less than 20 are afflicted, this one here just happened to be the first...
additionally sometimes the information is in table[4], other times in table[5]. I did account for this and wrote a function that will determine the correct table. Although I tested the function a LOT and thought it working correctly (hence did not inlcude it above), it did not.
So I made a better xpath:
'/html/body/table[tr/td[#width="20%"]]/tr[position()>1]'
and, although this is not related, I ran into another problem with unxpected encoding in the html file (not utf-8) which was fixed by adding:
parser = etree.XMLParser(encoding='ISO-8859-15')
t = etree.parse(forumfile, parser)
I am now confident that after adjusting for strange additional and multiple , and tags my code will work on all files...
Still I will be looking into lxml.html, as I mentioned in the comment, I have never used it before, but if it is more robust and may allow for using the files without tagsoup, it might be a better fit and save me extensive try/except statements and loops to fix the few files screwing with my current script...
I am having some strange behavior while using urllib2 to open a URL and download a video.
I am trying to open a video resource and here is an example link:
https://zencoder-temp-storage-us-east-1.s3.amazonaws.com/o/20130723/b3ed92cc582885e27cb5c8d8b51b9956/b740dc57c2a44ea2dc2d940d93d772e2.mp4?AWSAccessKeyId=AKIAI456JQ76GBU7FECA&Signature=S3lvi9n9kHbarCw%2FUKOknfpkkkY%3D&Expires=1374639361
I have the following code:
mp4_url = ''
#response_body is a json response that I get the mp4_url from
if response_body['outputs'][0]['label'] == 'mp4':
mp4_url = response_body['outputs'][0]['url']
if mp4_url:
logging.info('this is the mp4_url')
logging.info(mp4_url)
#if I add the line directly below this then it works just fine
mp4_url = 'https://zencoder-temp-storage-us-east-1.s3.amazonaws.com/o/20130723/b3ed92cc582885e27cb5c8d8b51b9956/b740dc57c2a44ea2dc2d940d93d772e2.mp4?AWSAccessKeyId=AKIAI456JQ76GBU7FECA&Signature=S3lvi9n9kHbarCw%2FUKOknfpkkkY%3D&Expires=1374639361'
mp4_video = urllib2.urlopen(mp4_url)
logging.info('succesfully opened the url')
The code works when I add the designated line but it gives me a HTTP Error 403: Forbidden message when I don't which makes me think it is messing up the mp4_url somehow. But the confusing part is that when I check the logging line for mp4_url it is exactly what I hardcoded in there. What could the difference be? Are there some characters in there that may be disrupting it? I have tried converting it to a string by doing:
mp4_video = urllib2.urlopen(str(mp4_url))
But that didn't do anything. Any ideas?
UPDATE:
With the suggestion to use print repr(mp4_url) it is giving me:
u'https://zencoder-temp-storage-us-east-1.s3.amazonaws.com/o/20130723/b3ed92cc582885e27cb5c8d8b51b9956/b740dc57c2a44ea2dc2d940d93d772e2.mp4?AWSAccessKeyId=AKIAI456JQ76GBU7FECA&Signature=S3lvi9n9kHbarCw%2FUKOknfpkkkY%3D&Expires=1374639361'
And I suppose the difference is what is causing the error but what would be the best way to parse this?
UPDATE II:
It ended up that I did need to cast it to a string but also the source that I was getting the link (an encoded video) needed nearly a 60 second delay before it could serve that URL so that is why it kept working when I hardcoded it because it had that delay. Thanks for the help!
It would be better to simply dump the response obtained. This way you would be able to check what response_body['outputs'][0]['label'] evaluates to. In you case, you are initializing mp4_url to ''. This is not the same as None and hence the condition if mp4_url: will always be true.
You may want to check that the initial if statement where you check that response_body['outputs'][0]['label'] is correct.