I am using Readability Parser API to extract content from a web page. It is ok when the web page is in Latin character set, but when I extract article in Cyrillic, it ends up with the following:
<div>Ввоскресень</div>...etc
The interesting thing here is that the title of a web page is extracted correctly in Cyrillic, but not the content. My attempt was to do the following as it suggested in this SO answer:
content = unicodedata.normalize('NFKD', content).encode('ascii','ignore')
but it did not work. Could you tell me if there is a way to convert this string before saving to database?
Please let me know if the title of my question explains correctly what I need. Thank you.
One way (Python 3.3):
>>> s='<div>Ввоскресень</div>'
>>> import html.parser
>>> h=html.parser.HTMLParser()
>>> h.unescape(s)
'<div>Ввоскресень</div>'
Python 2.7:
>>> s='<div>Ввоскресень</div>'
>>> import HTMLParser
>>> h=HTMLParser.HTMLParser()
>>> print(h.unescape(s))
<div>Ввоскресень</div>
P.S. I went to look for the documentation link and it looks like unescape isn't documented. Here's a way without using an undocumented API:
>>> re.sub(r'&#x(.*?);',lambda x: chr(int(x.group(1),16)),s)
'<div>Ввоскресень</div>'
Per comment it looks finally documented (and moved) in Python 3.4:
https://docs.python.org/3.4/library/html.html#html.unescape
Related
I'm trying to send Bengali text using an SMS Gateway. However it doesn't normally support Bengali text. Their documentation says I need to convert the SMS string to utf-16be; without any other details. However I found a Python implementation of what I'm looking for here .
>>> message = 'আমার সোনার বাংলা'
>>> message
'আমার সোনার বাংলা'
>>> message.encode('utf-16-be')
b'\t\x86\t\xae\t\xbe\t\xb0\x00 \t\xb8\t\xcb\t\xa8\t\xbe\t\xb0\x00 \t\xac\t\xbe\t\x82\t\xb2\t\xbe'
>>> message.encode('utf-16-be').hex()
'098609ae09be09b0002009b809cb09a809be09b0002009ac09be098209b209be'
>>> message.encode('utf-16-be').hex().upper()
'098609AE09BE09B0002009B809CB09A809BE09B0002009AC09BE098209B209BE'
I am trying to accomplish two things here:
Understand the Python Implementation
Replicate the same procedure in Ruby 2.6
So far I've come up with following
text = 'আমার সোনার বাংলা'.encode("UTF-16BE")
p text
#output-> "\u0986\u09AE\u09BE\u09B0 \u09B8\u09CB\u09A8\u09BE\u09B0 \u09AC\u09BE\u0982\u09B2\u09BE"
Typically converting from a string to bytes is accomplished with the unpack method:
# ref unpack documentation for specifics, but I use 'H*' here for hex
message.encode('utf-16-be').unpack('H*')
I have a String in Python, which has some HTML in it. Basically it looks like this.
>>> print someString # I get someString from the backend
"<img style='height:50px;' src='somepath'/>"
I try to display this HTML in a PDF. Because my PDF generator can't handle the styles-attribute (and no, I can't take another one), I have to remove it from the string. So basically, it should be like that:
>>> print someString # I get someString from the backend
"<img style='height:50px;' src='somepath'/>"
>>> parsedString = someFunction(someString)
>>> print parsedString
"<img src='somepath'/>"
I guess the best way to do this is with RegEx, but I'm not very keen on it. Can someone help me out?
I wouldn't use RegEx with this because
Regex is not really suited for HTML parsing and even though this is a simple one there could be many variations and edge cases you need to consider and the resulting regex could turn out to be a nightmare
Regex sucks. It can be really useful but honestly, they are the epitome of user unfriendly.
Alright, so how would I go about it. I would use trusty BeautifulSoup! Install with pip by using the following command:
pip install beautifulsoup4
Then you can do the following to remove the style:
from bs4 import BeautifulSoup as Soup
del Soup(someString).find('img')['style']
This first parses your string, then finds the img tag and then deletes its style attribute.
It should also work with arbitrary strings but I can't promise that. Maybe you will come up with an edge case.
Remember, using RegEx to parse an HTML string is not the best of ideas. The internet and Stackoverflow is full of answers why this is not possible.
Edit: Just for kicks you might want to check out this answer. You know stuff is serious when it is said that even Jon Skeet can't do it.
Using RegEx to work with HTML is a very bad idea but if you really want to use it, try this:
/style=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?/ig
I'm writing an python script which will extract the url of facebook video. But in the source of the video page, i see some characters of form \uxxxxxx in the url.
for instance url is in this form
https\u00253A\u00255C\u00252F\u00255C\u00252Ffbcdn-video-a.akamaihd.net\u00255C\u00252Fhvideo-ak-prn2\u00255C\u00252Fv\u00255C\u00252F753002_318048581647953_53890_n.mp4\u00253Foh\u00253D64e3e8ecf7e88f1da335d88949b2dc1f\u002526oe\u00253D52226D10\u002526__gda__\u00253D1377987338_9e37fb163a1d37d4b06ab7cff668f7dc\u002522\u00252C\u002522
\u00253A is colon (:), but how do i convert it.
When i did like
>>> x.decode('unicode_escape').encode('ascii','ignore')
i get
'https%3A%5C%2F%5C%2Ffbcdn-video-a.akamaihd.net%5C%2Fhvideo-ak-prn2%5C%2Fv%5C%2F753002_318048581647953_53890_n.mp4%3Foh%3D64e3e8ecf7e88f1da335d88949b2dc1f%26oe%3D52226D10%26__gda__%3D1377987338_9e37fb163a1d37d4b06ab7cff668f7dc%22%2C%22
I want exact url not percentage.
I searched a lot but couldn't find any help.
Thanks in advance
Edit
Is there any way if I pass the whole source of facebook page and then convert all such complex unicode character to simple one.
>>> import urllib
>>> s = b'https\u00253A\u00255C\u00252F\u00255C\u00252Ffbcdn-video'
>>> print urllib.unquote_plus(s.decode('unicode_escape'))
https:\/\/fbcdn-video
It seems that your string is backslashed.
>>> import re
>>> import urllib
>>> s = b'https\u00253A\u00255C\u00252F\u00255C\u00252Ffbcdn-video'
>>> re.sub(r'\\(.)', r'\1', urllib.unquote_plus(s.decode('unicode_escape')))
u'https://fbcdn-video'
I used BeautifulSoup to handle XML files that I have collected through a REST API.
The responses contain HTML code, but BeautifulSoup escapes all the HTML tags so it can be displayed nicely.
Unfortunately I need the HTML code.
How would I go on about transforming the escaped HTML into proper markup?
Help would be very much appreciated!
I think you want xml.sax.saxutils.unescape from the Python standard library.
E.g.:
>>> from xml.sax import saxutils as su
>>> s = '<foo>bar</foo>'
>>> su.unescape(s)
'<foo>bar</foo>'
You could try the urllib module?
It has a method unquote() that might suit your needs.
Edit: on second thought, (and more reading of your question) you might just want to just use string.replace()
Like so:
string.replace('<','<')
string.replace('>','>')
This question already has answers here:
Decode HTML entities in Python string?
(6 answers)
Closed 7 years ago.
I want to scrape some information off a football (soccer) web page using simple python regexp's. The problem is that players such as the first chap, ÄÄRITALO, comes out as ÄÄRITALO!
That is, html uses escaped markup for the special characters, such as Ä
Is there a simple way of reading the html into the correct python string? If it was XML/XHTML it would be easy, the parser would do it.
I would recommend BeautifulSoup for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:
>>> from BeautifulSoup import BeautifulSoup
>>> html = "<html>ÄÄRITALO!</html>"
>>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents[0].string
ÄÄRITALO!
(It would be nice if the standard codecs module included a codec for this, such that you could do "some_string".decode('html_entities') but unfortunately it doesn't!)
EDIT:
Another solution:
Python developer Fredrik Lundh (author of elementtree, among other things) has a function to unsecape HTML entities on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).
Try using BeautifulSoup. It should do the trick and give you a nicely formatted DOM to work with as well.
This blog entry seems to have had some success with it.
I haven't tried it myself, but have you tried
http://zesty.ca/python/scrape.html ?
It seems to have a method htmldecode(text) which would do what you want.