I'm writing an python script which will extract the url of facebook video. But in the source of the video page, i see some characters of form \uxxxxxx in the url.
for instance url is in this form
https\u00253A\u00255C\u00252F\u00255C\u00252Ffbcdn-video-a.akamaihd.net\u00255C\u00252Fhvideo-ak-prn2\u00255C\u00252Fv\u00255C\u00252F753002_318048581647953_53890_n.mp4\u00253Foh\u00253D64e3e8ecf7e88f1da335d88949b2dc1f\u002526oe\u00253D52226D10\u002526__gda__\u00253D1377987338_9e37fb163a1d37d4b06ab7cff668f7dc\u002522\u00252C\u002522
\u00253A is colon (:), but how do i convert it.
When i did like
>>> x.decode('unicode_escape').encode('ascii','ignore')
i get
'https%3A%5C%2F%5C%2Ffbcdn-video-a.akamaihd.net%5C%2Fhvideo-ak-prn2%5C%2Fv%5C%2F753002_318048581647953_53890_n.mp4%3Foh%3D64e3e8ecf7e88f1da335d88949b2dc1f%26oe%3D52226D10%26__gda__%3D1377987338_9e37fb163a1d37d4b06ab7cff668f7dc%22%2C%22
I want exact url not percentage.
I searched a lot but couldn't find any help.
Thanks in advance
Edit
Is there any way if I pass the whole source of facebook page and then convert all such complex unicode character to simple one.
>>> import urllib
>>> s = b'https\u00253A\u00255C\u00252F\u00255C\u00252Ffbcdn-video'
>>> print urllib.unquote_plus(s.decode('unicode_escape'))
https:\/\/fbcdn-video
It seems that your string is backslashed.
>>> import re
>>> import urllib
>>> s = b'https\u00253A\u00255C\u00252F\u00255C\u00252Ffbcdn-video'
>>> re.sub(r'\\(.)', r'\1', urllib.unquote_plus(s.decode('unicode_escape')))
u'https://fbcdn-video'
Related
I'm writing a Python script to fetch Korean vocabulary pronunciation. I have a URL ready to go, and when I open the URL in Safari, it retrieves the expected JSON from the server.
When I use requests to get the JSON, the call fails and no results are found.
Using Charles, I can see that the URL with my original query, a Hangul word, is URL encoded after I paste the URL into Safari and hit enter. For example, the instance of 소식 in the URL string becomes %EC%86%8C%EC%8B%9D on its way out.
However, when I make that same request with requests, the word is encoded as %E1%84%89%E1%85%A9%E1%84%89%E1%85%B5%E1%86%A8. Both encodings can be decoded back to the original word 소식 (using a web app to confirm). The former encoding is accepted by the server, the latter is not.
Why would I be getting a different encoding from requests?
Edit
Query string comes into the script as 소식
query = sys.argv[1]
sys.stderr.write(query) -> 소식
Interpolating the query into the URL string yields ...json/word/소식... when printing it.
Going through Charles it now looks like this /json/word/%E1%84%89%E1%85%A9%E1%84%89%E1%85%B5%E1%86%A8/. Everything is default, no specified encoding.
These are both valid url-encodings of the "same" input text:
>>> from urllib.parse import unquote
>>> ulong = unquote('%E1%84%89%E1%85%A9%E1%84%89%E1%85%B5%E1%86%A8')
>>> ushort = unquote('%EC%86%8C%EC%8B%9D')
>>> ulong
'소식'
>>> ushort
'소식'
The strings are not actually equal, though, they have different forms in unicode:
>>> from unicodedata import name
>>> [name(x) for x in ulong]
['HANGUL CHOSEONG SIOS',
'HANGUL JUNGSEONG O',
'HANGUL CHOSEONG SIOS',
'HANGUL JUNGSEONG I',
'HANGUL JONGSEONG KIYEOK']
>>> [name(x) for x in ushort]
['HANGUL SYLLABLE SO', 'HANGUL SYLLABLE SIG']
I do not know any Korean, but it looks like the long string is composed of combining characters (you can also see similar things with latin characters and accents). If I perform a canonical decomposition and composition of the forms, I get equality:
>>> from unicodedata import normalize
>>> normalize('NFC', ulong) == ushort
True
So, either you are using different input texts, that just happen to look the same (even repr is not enough to see the difference, you have to examine the codepoints) or one of the methods you are using - probably the browser - is performing a normalization/transformation.
Since the short form of the text is what worked with the server, I suggest you normalize the inputs to your script into the NFC form.
I am using Readability Parser API to extract content from a web page. It is ok when the web page is in Latin character set, but when I extract article in Cyrillic, it ends up with the following:
<div>Ввоскресень</div>...etc
The interesting thing here is that the title of a web page is extracted correctly in Cyrillic, but not the content. My attempt was to do the following as it suggested in this SO answer:
content = unicodedata.normalize('NFKD', content).encode('ascii','ignore')
but it did not work. Could you tell me if there is a way to convert this string before saving to database?
Please let me know if the title of my question explains correctly what I need. Thank you.
One way (Python 3.3):
>>> s='<div>Ввоскресень</div>'
>>> import html.parser
>>> h=html.parser.HTMLParser()
>>> h.unescape(s)
'<div>Ввоскресень</div>'
Python 2.7:
>>> s='<div>Ввоскресень</div>'
>>> import HTMLParser
>>> h=HTMLParser.HTMLParser()
>>> print(h.unescape(s))
<div>Ввоскресень</div>
P.S. I went to look for the documentation link and it looks like unescape isn't documented. Here's a way without using an undocumented API:
>>> re.sub(r'&#x(.*?);',lambda x: chr(int(x.group(1),16)),s)
'<div>Ввоскресень</div>'
Per comment it looks finally documented (and moved) in Python 3.4:
https://docs.python.org/3.4/library/html.html#html.unescape
Example from Google:
http://www.google.com.co/url?sa=t&rct=j&q=pedro%20gomez%20proyecto%20en%20la%20ciudad%20de%20valledupar&source=web&cd=10&ved=0CFsQFjAJ&url=http%3A%2F%2Fwww.21molino.com%2F1410%2F8911.html
or from Bing search:
http://www.bing.com/search?q=10%2F30+Sand&src=IE-SearchBox&FORM=IE8SRC
I want parse and match ?q= or q= keywords, using (?<=)? with the python re module.
How can you can pass the multiple parameters in encode the ascii url to utf-8 so that it can be read?
Need some help here, thanks very much : )
Try this:
[?&]q=([^&#]*)
Or, better yet:
import urlparse
pr = urlparse.urlparse(url)
qs = urlparse.parse_qs(pr.query)['q']
The latter automatically decodes %-escapes, too.
I used BeautifulSoup to handle XML files that I have collected through a REST API.
The responses contain HTML code, but BeautifulSoup escapes all the HTML tags so it can be displayed nicely.
Unfortunately I need the HTML code.
How would I go on about transforming the escaped HTML into proper markup?
Help would be very much appreciated!
I think you want xml.sax.saxutils.unescape from the Python standard library.
E.g.:
>>> from xml.sax import saxutils as su
>>> s = '<foo>bar</foo>'
>>> su.unescape(s)
'<foo>bar</foo>'
You could try the urllib module?
It has a method unquote() that might suit your needs.
Edit: on second thought, (and more reading of your question) you might just want to just use string.replace()
Like so:
string.replace('<','<')
string.replace('>','>')
I want to convert Python string to URL syntax.
For example
>>> u'한글'.encode('utf-8')
'\xed\x95\x9c\xea\xb8\x80' to '%ed%95%9c%ea%b8%80'
>>> import urllib2
>>> urllib2.quote('한글')
'%ED%95%9C%EA%B8%80'