Parsing URL with regex - python

I'm trying to combine if else inside my regular expression, basically if some patterns exists in the string, capture one pattern, if not, capture another.
The string is:
'https://www.searchpage.com/searchcompany.aspx?companyId=41490234&page=0&leftlink=true" and I want to extract staff around the '?"
So if '?' is detected inside the string, the regular expression should capture everything after the '?' mark; if not, then just capture from the beginning.
I used:'(.*\?.*)?(\?.*&.*)|(^&.*)'
But it didn't work...
Any suggestion?
Thanks!

Use urlparse:
>>> import urlparse
>>> parse_result = urlparse.urlparse('https://www.searchpage.com/searchcompany.aspx?
companyId=41490234&page=0&leftlink=true')
>>> parse_result
ParseResult(scheme='https', netloc='www.searchpage.com',
path='/searchcompany.aspx', params='',
query='companyId=41490234&page=0&leftlink=true', fragment='')
>>> urlparse.parse_qs(parse_result.query)
{'leftlink': ['true'], 'page': ['0'], 'companyId': ['41490234']}
The last line is a dictionary of key/value pairs.

regex might not be the best solution to this problem ...why not just
my_url.split("?",1)
if that is truly all you wish to do
or as others have suggested
from urlparse import urlparse
print urlparse(my_url)

This regex:
(^[^?]*$|(?<=\?).*)
captures:
^[^?]*$ everything, if there's no ?, or
(?<=\?).* everything after the ?, if there is one
However, you should look into urllib.parse (Python 3) or urlparse (Python 2) if you're working with URLs.

Related

how to regex this link?

I want to regex a list of URLs.
The links format looks like this:
`https://en.wikipedia.org/wiki/Alexander_Pushkin'
The part I need:
en.wikipedia.org
Can you help, please?
Instead of looking for \w etc. which would only match the domain, you're effectively looking for anything up to where the URL arguments start (the first ?):
re.search(r'[^?]*', URL)
This means: from the beginning of the string (search), all characters that are not ?. A character class beginning with ^ negates the class, i.e. not matching instead of matching.
This gives you a match object, where [0] will be the URL you're looking for.
You can do that wihtout using regex by leveraging urllib.parse.urlparse
from urllib.parse import urlparse
url = "https://sales-office.ae/axcapital/damaclagoons/?cm_id=14981686043_130222322842_553881409427_kwd-1434230410787_m__g_&gclid=Cj0KCQiAxc6PBhCEARIsAH8Hff2k3IHDPpViVTzUfxx4NRD-fSsfWkCDT-ywLPY2C6OrdTP36x431QsaAt2dEALw_wcB"
parsed_url = urlparse(url)
print(f"{parsed_url.scheme}://{parsed_url.netloc}{parsed_url.path}")
Outputs
https://sales-office.ae/axcapital/damaclagoons/
Based on your example, this looks like it would work:
\w+://\S+\.\w+\/\S+\/
Based on: How to match "anything up until this sequence of characters" in a regular expression?
.+?(?=\?)
so:
re.findall(".+?(?=\?)", URL)

Is there any way to match a regex that starts with one string but *doesn't* start with another string?

So I'm trying to get more familiar with Python web scraping and I'm trying to find external links only for a specific function. In the books I'm reading the author implements this by simply removing the "http://" from the string and then seeing if the new link contains the new string (which is the domain name without the preceding "http://".
I can see how this code might fail and although I can simply write an if statement it does make me wonder - is there any way to match all links that start with "http" but not with "http(s)://domain.com"? I tried many different regex solutions that I thought would work but they havent.
For example, the variable "site" contains the link address.
re.compile("^((?!"+site+").)^http|www*$"))
re.compile("^http|www((?!"+site+").)*$"))
The results I get would simply be all links that start with http or www and that's not what I Intend to do.
Again, I can implement this just fine with an if statement and filter the results, this isn't a complete blocker, but I'm curious about the existance of such a possibility
Any help would be appreciated. I looked around the web but couldn't find anything that matches my use case.
I'll not recommend you using regex for this task but i recommend you using urlparse from urllib.parse module.
Here is an example:
$> from urllib.parse import urlparse
$> url = urlparse('https://google.com')
ParseResult(scheme='https', netloc='google.com', path='', params='', query='', fragment='')
$> url.scheme
'https'
$> url.netloc
'google.com'
$> urlparse('https://www.google.com')
ParseResult(scheme='https', netloc='www.google.com', path='', params='', query='', fragment='')
To match a string that starts with one string but not with another one, you shoud use this pattern :
^(?!stringyoudontwant)stringyouwant.*
So in your case, this would be :
^(?!https?:\/\/domain\.com)http.*
For this kind of things, you can check out https://regex101.com which is the perfect interface to experiment with complicated regexes.

How do I ensure that re.findall() stops at the right place?

Here is the code I have:
a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'
import re
re.findall(r'<(title)>(.*)<(/title)>', a)
The result is:
[('title', 'aaa</title><title>aaa2</title><title>aaa3', '/title')]
If I ever designed a crawler to get me titles of web sites, I might end up with something like this rather than a title for the web site.
My question is, how do I limit findall to a single <title></title>?
Use re.search instead of re.findall if you only want one match:
>>> s = '<title>aaa</title><title>aaa2</title><title>aaa3</title>'
>>> import re
>>> re.search('<title>(.*?)</title>', s).group(1)
'aaa'
If you wanted all tags, then you should consider changing it to be non-greedy (ie - .*?):
print re.findall(r'<title>(.*?)</title>', s)
# ['aaa', 'aaa2', 'aaa3']
But really consider using BeautifulSoup or lxml or similar to parse HTML.
Use a non-greedy search instead:
r'<(title)>(.*?)<(/title)>'
The question-mark says to match as few characters as possible. Now your findall() will return each of the results you want.
http://docs.python.org/2/howto/regex.html#greedy-versus-non-greedy
re.findall(r'<(title)>(.*?)<(/title)>', a)
Add a ? after the *, so it will be non-greedy.
It will be much easier using BeautifulSoup module.
https://pypi.python.org/pypi/beautifulsoup4

python regex urls

I have a bunch of (ugly if I may say) urls, which I would like to clean up using python regex. So, my urls look something like:
http://www.thisislink1.com/this/is/sublink1/1
http://www.thisislink2.co.uk/this/is/sublink1s/klinks
http://www.thisislinkd.co/this/is/sublink1/hotlinks/2
http://www.thisislinkf.com.uk/this/is/sublink1d/morelink
http://www.thisislink1.co.in/this/is/sublink1c/mylink
....
What I'd like to do is clean up these urls, so that the final link looks like:
http://www.thisislink1.com
http://www.thisislink2.co.uk
http://www.thisislinkd.co
http://www.thisislinkf.de
http://www.thisislink1.us
....
and I was wondering how I can achieve this in a pythonic way. Sorry if this is a 101 question - I am new to pytho regex structures.
Use urlparse.urlsplit:
In [3]: import urlparse
In [8]: url = urlparse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')
In [9]: url.netloc
Out[9]: 'www.thisislink1.com'
In Python3 it would be
import urllib.parse as parse
url = parse.urlsplit('http://www.thisislink1.com/this/is/sublink1/1')
Why use regex?
>>> import urlparse
>>> url = 'http://www.thisislinkd.co/this/is/sublink1/hotlinks/2'
>>> urlparse.urlsplit(url)
SplitResult(scheme='http', netloc='www.thisislinkd.co', path='/this/is/sublink1/hotlinks/2', query='', fragment='')
You should use a URL parser like others have suggested but for completeness here is a solution with regex:
import re
url='http://www.thisislink1.com/this/is/sublink1/1'
re.sub('(?<![/:])/.*','',url)
>>> 'http://www.thisislink1.com'
Explanation:
Match everything after and including the first forwardslash that is not preceded by a : or / and replace it with nothing ''.
(?<![/:]) # Negative lookbehind for '/' or ':'
/.* # Match a / followed by anything
Maybe use something like this:
result = re.sub(r"(?m)(http://(www)?\..*?)/", r"\1", subject)

Delete Chars in Python

does anybody know how to delete all characters behind a specific character??
like this:
http://google.com/translate_t
into
http://google.com
if you're asking about an abstract string and not url you could go with:
>>> astring ="http://google.com/translate_t"
>>> astring.rpartition('/')[0]
http://google.com
For urls, using urlparse:
>>> import urlparse
>>> parts = urlparse.urlsplit('http://google.com/path/to/resource?query=spam#anchor')
>>> parts
('http', 'google.com', '/path/to/resource', 'query=spam', 'anchor')
>>> urlparse.urlunsplit((parts[0], parts[1], '', '', ''))
'http://google.com'
For arbitrary strings, using re:
>>> import re
>>> re.split(r'\b/\b', 'http://google.com/path/to/resource', 1)
['http://google.com', 'path/to/resource']
str="http://google.com/translate_t"
shortened=str[0:str.rfind("/")]
Should do it. str[a:b] returns a substring in python. And rfind is used to find the index of a character sequence, starting at the end of the string.
If you know the position of the character then you can use the slice syntax to to create a new string:
In [2]: s1 = "abc123"
In [3]: s2 = s1[:3]
In [4]: print s2
abc
To find the position you can use the find() or index() methods of strings.
The split() and partition() methods may be useful, too.
Those methods are documented in the Python docs for sequences.
To remove a part of a string is imposible because strings are immutable.
If you want to process URLs then you should definitely use the urlparse library. It lets you split an URL into its parts. If you just want remove a part of the file path then you will have to do that still by yourself.

Categories

Resources