Parsing url string from '+' to '%2B'

Parsing url string from '+' to '%2B' - python

I have url address where its extension needs to be in ASCII/UTF-8
a='sAE3DSRAfv+HG='
i need to convert above as this:
a='sAE3DSRAfv%2BHG%3D'
I searched but not able to get it.

Please see built-in method urllib.parse.quote()
A very important task for the URL is its safe transmission. Its meaning must not change after you created it till it is received by the intended receiver. To achieve that end URL encoding was incorporated. See RFC 2396
URL might contain non-ascii characters like cafés, López etc. Or it might contain symbols which have different meaning when put in the context of a URL. For example, # which signifies a bookmark. To ensure safe transmitting of such characters HTTP standards maintain that you quote the url at the point of origin. And URL is always present in quoted format to anyone else.
I have put sample usage below.
>>> import urllib.parse
>>> a='sAE3DSRAfv+HG='
>>> urllib.parse.quote(a)
'sAE3DSRAfv%2BHG%3D'
>>>

Related

How to change the way my requests.get() function is sent?

I am trying to create an http request to get some json data from a site online. When I set up the requests.get() function, it seems to be translating some of the special characters in the parameters to other values, causing the response to fail. Is there a way to control how the .get() is sent?
I'm trying to send this http request:
'https://registers.esma.europa.eu/solr/esma_registers_firds_files/select?q=*&fq=publication_date:%5B2020-05-10T00:00:00Z+TO+2020-05-10T23:59:59Z%5D&wt=json&indent=true&start=0&rows=100'
To do so, here is my code:
response = requests.get('https://registers.esma.europa.eu/solr/esma_registers_firds_files/select',
params={'q':'*',
'fq':'publication_date:%5B2020-05-10T00:00:00Z+TO+2020-05-10T23:59:59Z%5D',
'wt':'json',
'indent': 'true',
'start':0,
'rows':100},)
However, this code seems to translate the '*' character and the ':' character into a different format, which means I'm getting a bad response code. Here is how it prints out when I run the .url() on the code:
response.url
https://registers.esma.europa.eu/solr/esma_registers_firds_files/select?q=%2A&fq=publication_date%3A%255B2020-05-10T00%3A00%3A00Z%2BTO%2B2020-05-10T23%3A59%3A59Z%255D&wt=json&indent=true&start=0&rows=100
You can see that the '*' in the 'q' param became '%2A', and the ':' in the 'fq' param became '%3A', etc.
I know the link works, because if I enter it directly into the requests.get(), I get the results I expect.
Is there a way to make it so that the special characters in the .get() don't change? I've been googling anything related to the requests module and character encoding, but haven't had any luck. I could just use the whole url each time I need it, but I think that using params is better practice. Any help would be much appreciated. Thanks!

That's not actually the problem. The conversion you're seeing is supposed to happen. It's called URL encoding.
The problem is in the publication_date value. See the %5B and %5D and the + signs?
'fq':'publication_date:%5B2020-05-10T00:00:00Z+TO+2020-05-10T23:59:59Z%5D'
^^^ ^ ^ ^^^
I don't know where you got this string, but this string has already gone through URL encoding. The %5B, %5D, and + are encoded forms of [, ], and space. You need to provide unencoded values, like this:
'fq':'publication_date:[2020-05-10T00:00:00Z TO 2020-05-10T23:59:59Z]'
requests will handle the encoding.

URL Encoding yields two different results? Only one works

I'm writing a Python script to fetch Korean vocabulary pronunciation. I have a URL ready to go, and when I open the URL in Safari, it retrieves the expected JSON from the server.
When I use requests to get the JSON, the call fails and no results are found.
Using Charles, I can see that the URL with my original query, a Hangul word, is URL encoded after I paste the URL into Safari and hit enter. For example, the instance of 소식 in the URL string becomes %EC%86%8C%EC%8B%9D on its way out.
However, when I make that same request with requests, the word is encoded as %E1%84%89%E1%85%A9%E1%84%89%E1%85%B5%E1%86%A8. Both encodings can be decoded back to the original word 소식 (using a web app to confirm). The former encoding is accepted by the server, the latter is not.
Why would I be getting a different encoding from requests?
Edit
Query string comes into the script as 소식
query = sys.argv[1]
sys.stderr.write(query) -> 소식
Interpolating the query into the URL string yields ...json/word/소식... when printing it.
Going through Charles it now looks like this /json/word/%E1%84%89%E1%85%A9%E1%84%89%E1%85%B5%E1%86%A8/. Everything is default, no specified encoding.

These are both valid url-encodings of the "same" input text:
>>> from urllib.parse import unquote
>>> ulong = unquote('%E1%84%89%E1%85%A9%E1%84%89%E1%85%B5%E1%86%A8')
>>> ushort = unquote('%EC%86%8C%EC%8B%9D')
>>> ulong
'소식'
>>> ushort
'소식'
The strings are not actually equal, though, they have different forms in unicode:
>>> from unicodedata import name
>>> [name(x) for x in ulong]
['HANGUL CHOSEONG SIOS',
'HANGUL JUNGSEONG O',
'HANGUL CHOSEONG SIOS',
'HANGUL JUNGSEONG I',
'HANGUL JONGSEONG KIYEOK']
>>> [name(x) for x in ushort]
['HANGUL SYLLABLE SO', 'HANGUL SYLLABLE SIG']
I do not know any Korean, but it looks like the long string is composed of combining characters (you can also see similar things with latin characters and accents). If I perform a canonical decomposition and composition of the forms, I get equality:
>>> from unicodedata import normalize
>>> normalize('NFC', ulong) == ushort
True
So, either you are using different input texts, that just happen to look the same (even repr is not enough to see the difference, you have to examine the codepoints) or one of the methods you are using - probably the browser - is performing a normalization/transformation.
Since the short form of the text is what worked with the server, I suggest you normalize the inputs to your script into the NFC form.

Problems writing a regex in testcases.xml of pylot

I have to verify a list of strings to be present in a response to a soap request. I am using pylot testing tool. I know that if I use a string inside <verify>abcd</verify>element it works fine. I have to use regex though and I seem to face problems with the same since I am not good with regex.
I have to verify if <TestName>Abcd Hijk</TestName> is present in my response for the request sent.
Following is my attempt to write the regex inside testcases.xml
<verify>[.TestName.][\w][./TestName.]</verify>
Is this the correct way to write a regex in testcases.xml file? I want to exactly verify the tagnames and its values mentioned above.
When I run the tool, it gives me no errors. But If I change the the characters to <verify>[.TesttttName.][\w][./TestttttName.]</verify> and run the tool, it still run without giving errors. While this should be a failed run since no tags like the one mentioned is present in the response!
Could someone please tell me what I am doing wrong in the regex here?
Any help would be appreciated. Thanks!

The regex used should be like the following.
<verify>&lt;TestName&gt;[\w\s]+&lt;/TestName&gt;</verify>
The reason being, Pylot has the response content in the form of a text i.e, [the above part in the response would be like the following]
.......<TestName>ABCd Hijk</TestName>.....
What Pylot does is, when it parses element in the Testcases.xml, it takes the value of the element in TEXT format. Then it searches for the 'verify text' in the response which it got from the request.
Hence whenever we would want to verify anything in Pylot using regex we need to put the regex in the above format so that it gives the required results.
Note: One has to be careful of the response format received. To view the response got from the request, Enable the Log Messages on the tool or if you want to view the response on the console, edit the tools engine.py module and insert print statements.

The raw regular expression (no XML escape). I assume you want to accept English alphabet a-zA-Z, digits 0-9, underscore _ and space characters (space, new line, carriage return, and a few others - check documentation for details).
<TestName>[\w\s]+</TestName>
You need to escape the < and > to specify inside <verify> tag:
<TestName>[\w\s]+</TestName>

Python: How to check if a string is a valid IRI?

Is there a standard function to check an IRI, to check an URL apparently I can use:
parts = urlparse.urlsplit(url)
if not parts.scheme or not parts.netloc:
'''apparently not an url'''
I tried the above with an URL containing Unicode characters:
import urlparse
url = "http://fdasdf.fdsfîășîs.fss/ăîăî"
parts = urlparse.urlsplit(url)
if not parts.scheme or not parts.netloc:
print "not an url"
else:
print "yes an url"
and what I get is yes an url. Does this means I'm good an this tests for valid IRI? Is there another way ?

Using urlparse is not sufficient to test for a valid IRI.
Use the rfc3987 package instead:
from rfc3987 import parse
parse('http://fdasdf.fdsfîășîs.fss/ăîăî', rule='IRI')

The only character-set-sensitive code in the implementation of urlparse is requiring that the scheme should contain only ASCII letters, digits and [+-.] characters; otherwise it's completely agnostic so will work fine with non-ASCII characters.
As this is non-documented behaviour, it's your responsibility to check that it continues to be the case (with tests in your project), but I don't imagine it would be changed to break IRIs.
urllib provides quoting functions to convert IRIs to/from ASCII URIs, although they still don't mention IRIs explicitly in the documentation, and they are broken in some cases: Is there a unicode-ready substitute I can use for urllib.quote and urllib.unquote in Python 2.6.5?

How can I normalize a URL in python

I'd like to know do I normalize a URL in python.
For example, If I have a url string like : "http://www.example.com/foo goo/bar.html"
I need a library in python that will transform the extra space (or any other non normalized character) to a proper URL.

Have a look at this module: werkzeug.utils. (now in werkzeug.urls)
The function you are looking for is called "url_fix" and works like this:
>>> from werkzeug.urls import url_fix
>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'
It's implemented in Werkzeug as follows:
import urllib
import urlparse
def url_fix(s, charset='utf-8'):
"""Sometimes you get an URL by a user that just isn't a real
URL because it contains unsafe characters like ' ' and so on. This
function can fix some of the problems in a similar way browsers
handle data entered by the user:
>>> url_fix(u'http://de.wikipedia.org/wiki/Elf (Begriffsklärung)')
'http://de.wikipedia.org/wiki/Elf%20%28Begriffskl%C3%A4rung%29'
:param charset: The target charset for the URL if the url was
given as unicode string.
"""
if isinstance(s, unicode):
s = s.encode(charset, 'ignore')
scheme, netloc, path, qs, anchor = urlparse.urlsplit(s)
path = urllib.quote(path, '/%')
qs = urllib.quote_plus(qs, ':&=')
return urlparse.urlunsplit((scheme, netloc, path, qs, anchor))

Real fix in Python 2.7 for that problem
Right solution was:
# percent encode url, fixing lame server errors for e.g, like space
# within url paths.
fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'#()*[]")
For more information see Issue918368: "urllib doesn't correct server returned urls"

use urllib.quote or urllib.quote_plus
From the urllib documentation:
quote(string[, safe])
Replace special characters in string
using the "%xx" escape. Letters,
digits, and the characters "_.-" are
never quoted. The optional safe
parameter specifies additional
characters that should not be quoted
-- its default value is '/'.
Example: quote('/~connolly/') yields '/%7econnolly/'.
quote_plus(string[, safe])
Like quote(), but also replaces spaces
by plus signs, as required for quoting
HTML form values. Plus signs in the
original string are escaped unless
they are included in safe. It also
does not have safe default to '/'.
EDIT: Using urllib.quote or urllib.quote_plus on the whole URL will mangle it, as #ΤΖΩΤΖΙΟΥ points out:
>>> quoted_url = urllib.quote('http://www.example.com/foo goo/bar.html')
>>> quoted_url
'http%3A//www.example.com/foo%20goo/bar.html'
>>> urllib2.urlopen(quoted_url)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python25\lib\urllib2.py", line 124, in urlopen
return _opener.open(url, data)
File "c:\python25\lib\urllib2.py", line 373, in open
protocol = req.get_type()
File "c:\python25\lib\urllib2.py", line 244, in get_type
raise ValueError, "unknown url type: %s" % self.__original
ValueError: unknown url type: http%3A//www.example.com/foo%20goo/bar.html
#ΤΖΩΤΖΙΟΥ provides a function that uses urlparse.urlparse and urlparse.urlunparse to parse the url and only encode the path. This may be more useful for you, although if you're building the URL from a known protocol and host but with a suspect path, you could probably do just as well to avoid urlparse and just quote the suspect part of the URL, concatenating with known safe parts.

Because this page is a top result for Google searches on the topic, I think it's worth mentioning some work that has been done on URL normalization with Python that goes beyond urlencoding space characters. For example, dealing with default ports, character case, lack of trailing slashes, etc.
When the Atom syndication format was being developed, there was some discussion on how to normalize URLs into canonical format; this is documented in the article PaceCanonicalIds on the Atom/Pie wiki. That article provides some good test cases.
I believe that one result of this discussion was Mark Nottingham's urlnorm.py library, which I've used with good results on a couple projects. That script doesn't work with the URL given in this question, however. So a better choice might be Sam Ruby's version of urlnorm.py, which handles that URL, and all of the aforementioned test cases from the Atom wiki.

Py3
from urllib.parse import urlparse, urlunparse, quote
def myquote(url):
parts = urlparse(url)
return urlunparse(parts._replace(path=quote(parts.path)))
>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/~user/with%20space/index.html?a=1&b=2'
Py2
import urlparse, urllib
def myquote(url):
parts = urlparse.urlparse(url)
return urlparse.urlunparse(parts[:2] + (urllib.quote(parts[2]),) + parts[3:])
>>> myquote('https://www.example.com/~user/with space/index.html?a=1&b=2')
'https://www.example.com/%7Euser/with%20space/index.html?a=1&b=2'
This quotes only the path component.

Just FYI, urlnorm has moved to github:
http://gist.github.com/246089

Valid for Python 3.5:
import urllib.parse
urllib.parse.quote([your_url], "\./_-:")
example:
import urllib.parse
print(urllib.parse.quote("http://www.example.com/foo goo/bar.html", "\./_-:"))
the output will be http://www.example.com/foo%20goo/bar.html
Font: https://docs.python.org/3.5/library/urllib.parse.html?highlight=quote#urllib.parse.quote

I encounter such an problem: need to quote the space only.
fullurl = quote(fullurl, safe="%/:=&?~#+!$,;'#()*[]") do help, but it's too complicated.
So I used a simple way: url = url.replace(' ', '%20'), it's not perfect, but it's the simplest way and it works for this situation.

A lot of answers here talk about quoting URLs, not about normalizing them.
The best tool to normalize urls (for deduplication etc.) in Python IMO is w3lib's w3lib.url.canonicalize_url util.
Taken from the official docs:
Canonicalize the given url by applying the following procedures:
- sort query arguments, first by key, then by value
percent encode paths ; non-ASCII characters are percent-encoded using UTF-8 (RFC-3986)
- percent encode query arguments ; non-ASCII characters are percent-encoded using passed encoding (UTF-8 by default)
- normalize all spaces (in query arguments) ‘+’ (plus symbol)
- normalize percent encodings case (%2f -> %2F)
- remove query arguments with blank values (unless keep_blank_values is True)
- remove fragments (unless keep_fragments is True)
- List item
The url passed can be bytes or unicode, while the url returned is always a native str (bytes in Python 2, unicode in Python 3).
>>> import w3lib.url
>>>
>>> # sorting query arguments
>>> w3lib.url.canonicalize_url('http://www.example.com/do?c=3&b=5&b=2&a=50')
'http://www.example.com/do?a=50&b=2&b=5&c=3'
>>>
>>> # UTF-8 conversion + percent-encoding of non-ASCII characters
>>> w3lib.url.canonicalize_url('http://www.example.com/r\u00e9sum\u00e9')
'http://www.example.com/r%C3%A9sum%C3%A9'
I've used this util with great success when broad crawling the web to avoid duplicate requests because of minor url differences (different parameter order, anchors etc)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing url string from '+' to '%2B' - python

I have url address where its extension needs to be in ASCII/UTF-8 a='sAE3DSRAfv+HG=' i need to convert above as this: a='sAE3DSRAfv%2BHG%3D' I searched but not able to get it.

Related

How to change the way my requests.get() function is sent?

URL Encoding yields two different results? Only one works

Problems writing a regex in testcases.xml of pylot

Python: How to check if a string is a valid IRI?

How can I normalize a URL in python

Categories

Resources