Python unescape URL - python

I have got a url in this form - http:\\/\\/en.wikipedia.org\\/wiki\\/The_Truman_Show. How can I make it normal url. I have tried using urllib.unquote without much success.
I can always use regular expressions or some simple string replace stuff. But I believe that there is a better way to handle this...

urllib.unquote is for replacing %xx escape codes in URLs with the characters they represent. It won't be useful for this.
Your "simple string replace stuff" is probably the best solution.

Have you tried using json.loads from the json module?
>>> json.loads('"http:\\/\\/en.wikipedia.org\\/wiki\\/The_Truman_Show"')
'http://en.wikipedia.org/wiki/The_Truman_Show'
The input that I'm showing isn't exactly what you have. I've wrapped it in double quotes to make it valid json.
When you first get it from the json, how are you decoding it? That's probably where the problem is.

It is too childish -- look for some library function when you can transform URL by yourself.
Since there are not other visible rules but "/" replaced by "\/", you can simply replace it back:
def unescape_this(url):
return url.replace(r"\\/", "/")

Related

Python encode spaces in url only and not other special characters

I know this question has been asked many times but I can't seem to find the variation that I'm looking for specifically.
I have a url, lets say its:
https://somethingA/somethingB/somethingC/some spaces here
I want to convert it to:
https://somethingA/somethingB/somethingC/some%20spaces%20here
I know I can do it with the replace function like below:
url = https://somethingA/somethingB/somethingC/some spaces here
url.replace(' ', '%20')
But I have a feeling that the best practice is probably to use the urllib.parse library. The problem is that when I use it, it encodes other special characters like : too.
So if I do:
url = https://somethingA/somethingB/somethingC/some spaces here
urllib.parse.quote(url)
I get:
https%3A//somethingA/somethingB/somethingC/some%20spaces%20here
Notice the : also gets converted to %3A. So my question is, is there a way I can achieve the same thing as replace with urllib? I think I would rather use a tried and tested library that is designed specifically to encode URLs instead of reinventing the wheel, which I may or may not be missing something leading to a security loop hole. Thank you.
So quote() there is built to work on just the path portion of a url. So you need to break things up a bit like this:
from urllib.parse import urlparse
url = 'https://somethingA/somethingB/somethingC/some spaces here'
parts = urlparse(url)
fixed_url = f"{parts.scheme}://{parts.netloc}/{urllib.parse.quote(parts.path)}"

Python - Remove 'style'-attribute from HTML

I have a String in Python, which has some HTML in it. Basically it looks like this.
>>> print someString # I get someString from the backend
"<img style='height:50px;' src='somepath'/>"
I try to display this HTML in a PDF. Because my PDF generator can't handle the styles-attribute (and no, I can't take another one), I have to remove it from the string. So basically, it should be like that:
>>> print someString # I get someString from the backend
"<img style='height:50px;' src='somepath'/>"
>>> parsedString = someFunction(someString)
>>> print parsedString
"<img src='somepath'/>"
I guess the best way to do this is with RegEx, but I'm not very keen on it. Can someone help me out?
I wouldn't use RegEx with this because
Regex is not really suited for HTML parsing and even though this is a simple one there could be many variations and edge cases you need to consider and the resulting regex could turn out to be a nightmare
Regex sucks. It can be really useful but honestly, they are the epitome of user unfriendly.
Alright, so how would I go about it. I would use trusty BeautifulSoup! Install with pip by using the following command:
pip install beautifulsoup4
Then you can do the following to remove the style:
from bs4 import BeautifulSoup as Soup
del Soup(someString).find('img')['style']
This first parses your string, then finds the img tag and then deletes its style attribute.
It should also work with arbitrary strings but I can't promise that. Maybe you will come up with an edge case.
Remember, using RegEx to parse an HTML string is not the best of ideas. The internet and Stackoverflow is full of answers why this is not possible.
Edit: Just for kicks you might want to check out this answer. You know stuff is serious when it is said that even Jon Skeet can't do it.
Using RegEx to work with HTML is a very bad idea but if you really want to use it, try this:
/style=["']?((?:.(?!["']?\s+(?:\S+)=|[>"']))+.)["']?/ig

adding regexp to yaml python

Is there any way to store and read this regexp in YAML by using python:
regular: /<title [^>]*lang=("|')wo("|')>/
Anyone have any idea or some solution for this ?
I have the following error:
% ch.encode('utf-8'), self.get_mark())
yaml.scanner.ScannerError: while scanning for the next token
found character '|' that cannot start any token
in "test.yaml", line 10, column 49
My code:
def test2():
clueAppconf = open('test.yaml')
clueContext = yaml.load(clueAppconf)
print clueContext['webApp']
Ok, it looks like the problem is the type of scalar you have chosen to represent this regex. If you're married to scalars (yaml strings), you'll need to use double quoted scalars and escape codes for your special characters that it chokes on. So, your yaml should look something like this:
regular: "/<title [^>]*lang=("\x7C')wo("\x7C')>/"
I've only escaped the character that it was choking on to maintain some semblance of readability, however you may need to escape additional ones depending on whether it throws more errors. Additionally, you could use unicode escape codes. That would look like this:
regular: "/<title [^>]*lang=("\u007C')wo("\u007C')>/"
I'm a little out on my yaml knowledge, so I don't know a way to maintain the special characters and their readability in the yaml. Based on my cursory scan of the yaml documentation, this was the best I could find.

[Python]How to deal with a string ending with one backslash?

I'm getting some content from Twitter API, and I have a little problem, indeed I sometimes get a tweet ending with only one backslash.
More precisely, I'm using simplejson to parse Twitter stream.
How can I escape this backslash ?
From what I have read, such raw string shouldn't exist ...
Even if I add one backslash (with two in fact) I still get an error as I suspected (since I have a odd number of backslashes)
Any idea ?
I can just forget about these tweets too, but I'm still curious about that.
Thanks : )
Prepending the string with r (stands for "raw") will escape all characters inside the string. For example:
print r'\b\n\\'
will output
\b\n\\
Have I understood the question correctly?
I guess you are looking a method similar to stripslashes in PHP. So, here you go:
Python version of PHP's stripslashes
You can try using raw strings by prepending an r (so nothing has to be escaped) to the string or re.escape().
I'm not really sure what you need considering I haven't seen the text of the response. If none of the methods you come up with on your own or get from here work, you may have to forget about those tweets.
Unless you update your question and come back with a real problem, I'm asserting that you don't have an issue except confusion.
You get the string from the Tweeter API, ergo the string does not show up in your code. “Raw strings” exist only in your code, and it is “raw strings” in code that can't end in a backslash.
Consider this:
def some_obscure_api():
"This exists in a library, so you don't know what it does"
return r"hello" + "\\" # addition just for fun
my_string = some_obscure_api()
print(my_string)
See? my_string happily ends in a backslash and your code couldn't care less.

Cleaning an XML file in Python before parsing

I'm using minidom to parse an xml file and it threw an error indicating that the data is not well formed. I figured out that some of the pages have characters like ไอเฟล &, causing the parser to hiccup. Is there an easy way to clean the file before I start parsing it? Right now I'm using a regular expressing to throw away anything that isn't an alpha numeric character and the </> characters, but it isn't quite working.
Try
xmltext = re.sub(u"[^\x20-\x7f]+",u"",xmltext)
It will get rid of everything except 0x20-0x7F range.
You may start from \x01, if you want want to keep control characters like tab, line breaks.
xmltext = re.sub(u"[^\x01-\x7f]+",u"",xmltext)
Take a look at µTidyLib, a Python wrapper to TidyLib.
If you do need the data with the strange characters you could, in stead of just stripping them, convert them to codes the XML parser can understand.
You could have a look at the unicodedata package, especially the normalize method.
I haven't used it myself, so I can't tell you all that much, but you could ask again here on SO if you decide you're going to convert and keep that data.
>>> import unicodedata
>>> unicodedata.normalize("NFKD" , u"ไภเฟล &")
u'a\u03001\u201ea\u0300 \u0327 a\u03001\u20aca\u0300 \u0327Y\u0308a\u0300 \u0327\xa5 &'
It looks like you're dealing with data which are saved with some kind of encoding "as if" they were ASCII. XML file should normally be UTF8, and SAX (the underlying parser used by minidom) should handle that, so it looks like something's wrong in that part of the processing chain. Instead of focusing on "cleaning up" I'd first try to make sure the encoding is correct and correctly recognized. Maybe a broken XML directive? Can you edit your Q to show the first few lines of the file, especially the <?xml ... directive at the very start?
I'd throw out all non-ASCII characters which can be identified by having the 8th bit (0x80) set (128 .. 255 respectively 0x80 .. 0xff).
You could read in the file into a Python string named old_str
Then perform a filter call in conjunction with a lambda statement:
new_str = filter(lambda x: x in string.ascii_letters, old_str)
Parse new_str
Many ways exist to accomplish stripping non-ASCII characters from a string.
This question might be related: How to check if a string in Python is in ASCII?

Categories

Resources