Python: Get data from URL query string - python

So I have a string in which I have an URL.
The URL/string is something like this:
https://example.com/main/?code=32ll48hma6ldfm01bpki&data=57600&data2=aardappels
I want to get the code but I coulnd't figure out how. I looked at the .split() method. But I do not think it is efficient. and I couldn't really find a way to get it working.

Use urlparse and parse_qs from urlparse module:
from urlparse import urlparse, parse_qs
# For Python 3:
# from urllib.parse import urlparse, parse_qs
url = ' https://example.com/main'
url += '/?code=32ll48hma6ldfm01bpki&data=57600&data2=aardappels'
parsed = urlparse(url)
code = parse_qs(parsed.query).get('code')[0]
It does exactly what you want.

As #IronFist mentions, .split() method works only if you assume there is no '&' in the code parameter. If not, you can use .split() method a couple of times and get the desired code paramter:
url = "https://example.com/main/?code=32ll48hma6ldfm01bpki&data=57600&data2=aardappels"
code = url.split('/?')[1].split('&')[0]

There are many ways doing this!!! Easier way is to use urlparse. and the other way is to use the regular expression, but experts suggest that using regular expression on URL's can be tedious and the code becomes very difficult to maintain.
Another easy way is as shown below,
str1 = 'https://example.com/main/?code=32ll48hma6ldfm01bpki&data=57600&data2=aardappels'
codeStart = str1.find('code=')
codeEnd = str1.find('&data=')
print str1[codeStart+5:codeEnd]

Using regular expressions:
>>> import re
>>> url = 'https://example.com/main/?code=32ll48hma6ldfm01bpki&data=57600&data2=aardappels'
>>> code = re.search("code=([0-9a-zA-Z]+)&?", url).group(1)
>>> print code
32ll48hma6ldfm01bpki

Related

How to extract more than one patterns from a string using Python Regular Expressions?

https://epolicy.companyname.co.in/PRODUCTNAME/UI/PremiumCalculation.aspx?utm_source=rtb&utm_medium=display&utm_campaign=dbmew-Category-pros&dclid=CO2g3u7Gy98CFUOgaAodUv4E0w
I have millions of such URLs and I want to extract two things from this.
PRODUCTNAME: always preceded by https://epolicy.companyname.co.in
*.aspx: Page accessed
I tried the following regular expression
re.findall('([a-zA-Z]+\.aspx | https://epolicy\.companyname\.co\.in/(.*?)/UI)', URL)
and a few variants of it. But it didn't work. What it the correct way to do this?
Try this !
Code :
import re
url = "https://epolicy.companyname.co.in/PRODUCTNAME/UI/PremiumCalculation.aspx?utm_source=rtb&utm_medium=display&utm_campaign=dbmew-Category-pros&dclid=CO2g3u7Gy98CFUOgaAodUv4E0w"
print(re.findall('https://[^/]*/(.*)/UI/(.*).aspx', url))
Output :
[('PRODUCTNAME', 'PremiumCalculation')]
Regex doesn't seem to be the right thing to use here at all. Rather, parse the URL, split the path, and get the first and last elements.
from urllib.parse import urlparse
from pathlib import PurePath
components = urlparse(url)
path = PurePath(url.path)
product_name = path.parts[1]
page = path.stem

How to strip random Chars at the end of a String with Regex / Strip() in Python?

What is the preferred way to cut off random characters at the end of a string in Python?
I am trying to simplify a list of URLs to do some analysis and therefore need to cut-off everything that comes after the file extension .php
Since the characters that follow after .php are different for each URL using strip() doesn't work. I thought about regex and substring(). But what would be the most efficient way to solve this task?
Example:
Let's say I have the following URLs:
example.com/index.php?random_var=random-19wdwka
example.org/index.php?another_var=random-2js9m2msl
And I want the output to be:
example.com/index.php
example.org/index.php
Thanks for your advice!
There are two ways to accomplish what you want.
If you know how the string ends:
In your example, if You know that the string ends with .php? then all you need to do is:
my_string.split('?')[0]
If you don't know how the string ends:
In this case you can use urlparse and take everything but the parameters.
from urlparse import urlparse
for url is urls:
p = urlparse(url)
print p.scheme + p.netloc + p.path
for url in urls:
result = url.split('?')[0]
print(result)
Split on your separator at most once, and take the first piece:
text="example.com/index.php?random_var=random-19wdwka"
sep="php"
rest = text.split(sep)[0]+".php"
print rest
It seems like what you really want are to strip away the parameters of the URL, you can also use
from urlparse import urlparse, urlunparse
urlunparse(urlparse(url)[:3] + ('', '', ''))
to replace the params, query and fragment parts of the URL with empty strings and generate a new one.

Python regex group extraction

For this string:
"https://webster.bfm.com/viewserver/rw?port=AAA-NY&rpttag=praada_pnl_sum_eq.BMACS_ASST_ALL&subdir=gp_views/MUS-ALLRET/released"
looking to run regular expression to look like the below:
list = [port=AAA,rpttag=praada_pnl_sum_eq.BMACS_ASST_ALL,subdir=gp_views/MUS-ALLRET/released]
I got this so far:
re.findall(r'\?(.+)','https://webster.bfm.com/viewserver/rw?port=AAA-NY&rpttag=praada_pnl_sum_eq.BMACS_ASST_ALL&subdir=gp_views/MUS-ALLRET/released')
that just returns one string. I know I need to just repeat this pattern, \S&+ using [], but can't see to figure out the best way to do this all in one regex
re.findall(r'[^?&]+', s)[1:]
This works by splitting on either ? or & and then throwing away the first match, which is the part up to the ?.
I'm making two assumptions here: first, that there are no ? characters in your fragments, and second, that you really want the first element of your list to be port=AAA-NY.
Using regex to parse URL is a bad idea when Python has built-in library to do the job:
Python 3
Use urlparse to parse the URL into schema, port, host, query, etc., then use parse_qs to parse the query string.
Do check out the documentation for parsing options for corner cases.
Example code:
from urllib.parse import *
input = 'https://webster.bfm.com/viewserver/rw?port=AAA-NY&rpttag=praada_pnl_sum_eq.BMACS_ASST_ALL&subdir=gp_views/MUS-ALLRET/released'
url = urlparse(input)
query_parts = parse_qs(url.query)
Printing query_parts:
>>> print(query_parts)
{'rpttag': ['praada_pnl_sum_eq.BMACS_ASST_ALL'], 'port': ['AAA-NY'], 'subdir': ['gp_views/MUS-ALLRET/released']}
Python 2
The code in Python 2.* is similar, but you need to import urlparse module, instead of urllib.parse. The functions are more or less the same.

python regex urls

I have a bunch of (ugly if I may say) urls, which I would like to clean up using python regex. So, my urls look something like:
http://www.thisislink1.com/this/is/sublink1/1
http://www.thisislink2.co.uk/this/is/sublink1s/klinks
http://www.thisislinkd.co/this/is/sublink1/hotlinks/2
http://www.thisislinkf.com.uk/this/is/sublink1d/morelink
http://www.thisislink1.co.in/this/is/sublink1c/mylink
....
What I'd like to do is clean up these urls, so that the final link looks like:
http://www.thisislink1.com
http://www.thisislink2.co.uk
http://www.thisislinkd.co
http://www.thisislinkf.de
http://www.thisislink1.us
....
and I was wondering how I can achieve this in a pythonic way. Sorry if this is a 101 question - I am new to pytho regex structures.
Use urlparse.urlsplit:
In [3]: import urlparse
In [8]: url = urlparse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')
In [9]: url.netloc
Out[9]: 'www.thisislink1.com'
In Python3 it would be
import urllib.parse as parse
url = parse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')
Why use regex?
>>> import urlparse
>>> url = 'http://www.thisislinkd.co/this/is/sublink1/hotlinks/2'
>>> urlparse.urlsplit(url)
SplitResult(scheme='http', netloc='www.thisislinkd.co', path='/this/is/sublink1/hotlinks/2', query='', fragment='')
You should use a URL parser like others have suggested but for completeness here is a solution with regex:
import re
url='http://www.thisislink1.com/this/is/sublink1/1'
re.sub('(?<![/:])/.*','',url)
>>> 'http://www.thisislink1.com'
Explanation:
Match everything after and including the first forwardslash that is not preceded by a : or / and replace it with nothing ''.
(?<![/:]) # Negative lookbehind for '/' or ':'
/.* # Match a / followed by anything
Maybe use something like this:
result = re.sub(r"(?m)(http://(www)?\..*?)/", r"\1", subject)

Python Find Question

I am using Python to extract the filename from a link using rfind like below:
url = "http://www.google.com/test.php"
print url[url.rfind("/") +1 : ]
This works ok with links without a / at the end of them and returns "test.php". I have encountered links with / at the end like so "http://www.google.com/test.php/". I am have trouble getting the page name when there is a "/" at the end, can anyone help?
Cheers
Just removing the slash at the end won't work, as you can probably have a URL that looks like this:
http://www.google.com/test.php?filepath=tests/hey.xml
...in which case you'll get back "hey.xml". Instead of manually checking for this, you can use urlparse to get rid of the parameters, then do the check other people suggested:
from urlparse import urlparse
url = "http://www.google.com/test.php?something=heyharr/sir/a.txt"
f = urlparse(url)[2].rstrip("/")
print f[f.rfind("/")+1:]
Use [r]strip to remove trailing slashes:
url.rstrip('/').rsplit('/', 1)[-1]
If a wider range of possible URLs is possible, including URLs with ?queries, #anchors or without a path, do it properly with urlparse:
path= urlparse.urlparse(url).path
return path.rstrip('/').rsplit('/', 1)[-1] or '(root path)'
Filenames with a slash at the end are technically still path definitions and indicate that the index file is to be read. If you actually have one that' ends in test.php/, I would consider that an error. In any case, you can strip the / from the end before running your code as follows:
url = url.rstrip('/')
There is a library called urlparse that will parse the url for you, but still doesn't remove the / at the end so one of the above will be the best option
Just for fun, you can use a Regexp:
import re
print re.search('/([^/]+)/?$', url).group(1)
You could use
print url[url.rstrip("/").rfind("/") +1 : ]
filter(None, url.split('/'))[-1]
(But urlparse is probably more readable, even if more verbose.)

Categories

Resources