I am using Python to extract the filename from a link using rfind like below:
url = "http://www.google.com/test.php"
print url[url.rfind("/") +1 : ]
This works ok with links without a / at the end of them and returns "test.php". I have encountered links with / at the end like so "http://www.google.com/test.php/". I am have trouble getting the page name when there is a "/" at the end, can anyone help?
Cheers
Just removing the slash at the end won't work, as you can probably have a URL that looks like this:
http://www.google.com/test.php?filepath=tests/hey.xml
...in which case you'll get back "hey.xml". Instead of manually checking for this, you can use urlparse to get rid of the parameters, then do the check other people suggested:
from urlparse import urlparse
url = "http://www.google.com/test.php?something=heyharr/sir/a.txt"
f = urlparse(url)[2].rstrip("/")
print f[f.rfind("/")+1:]
Use [r]strip to remove trailing slashes:
url.rstrip('/').rsplit('/', 1)[-1]
If a wider range of possible URLs is possible, including URLs with ?queries, #anchors or without a path, do it properly with urlparse:
path= urlparse.urlparse(url).path
return path.rstrip('/').rsplit('/', 1)[-1] or '(root path)'
Filenames with a slash at the end are technically still path definitions and indicate that the index file is to be read. If you actually have one that' ends in test.php/, I would consider that an error. In any case, you can strip the / from the end before running your code as follows:
url = url.rstrip('/')
There is a library called urlparse that will parse the url for you, but still doesn't remove the / at the end so one of the above will be the best option
Just for fun, you can use a Regexp:
import re
print re.search('/([^/]+)/?$', url).group(1)
You could use
print url[url.rstrip("/").rfind("/") +1 : ]
filter(None, url.split('/'))[-1]
(But urlparse is probably more readable, even if more verbose.)
Related
Question
Assume that I have a string like this:
example_text = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
Expectation
And I want to only extract the first url, which is
output = "https://www.example.com/link_1.html"
I think using regex to find the url start from "https" and end up '\' will be a good solution.
If so, how can I write the regex pattern?
I try something like this:
`
re.findall("https://([^\\\\)]+)", example_text)
output = ['www.example.com/link_1.html', 'www.example.com/link_2.html']
But then, I need to add "https://" back and choose the first item in the return.
Is there any other solution?
You need to tweak your regex a bit.
What you were doing before:
https://([^\\\\)]+) this matches your link but only captures the part after https:// since you used the capturing token after that.
Updated Regex:
(https\:\/\/[^\\\\)]+) this matches the link and also captures the whole token (escaped special characters to avoid errors)
In Code:
import re
input = 'b\'\\x08\\x13"\\\\https://www.example.com/link_1.html\\xd2\\x01`https://www.example.com/link_2.html\''
print(re.findall("(https\:\/\/[^\\\\)]+)", input))
Output:
['https://www.example.com/link_1.html', "https://www.example.com/link_2.html'"]
You could also use (https\:\/\/([^\\\\)]+).html) to get the link with https:// and without it as a tuple. (this also avoids the ending ' that you might get in some links)
If you want only the first one, simply do output[0].
Try:
match = re.search(r"https://[^\\']+", example_text)
url = match.group()
print(url)
output:
https://www.example.com/link_1.html
https://epolicy.companyname.co.in/PRODUCTNAME/UI/PremiumCalculation.aspx?utm_source=rtb&utm_medium=display&utm_campaign=dbmew-Category-pros&dclid=CO2g3u7Gy98CFUOgaAodUv4E0w
I have millions of such URLs and I want to extract two things from this.
PRODUCTNAME: always preceded by https://epolicy.companyname.co.in
*.aspx: Page accessed
I tried the following regular expression
re.findall('([a-zA-Z]+\.aspx | https://epolicy\.companyname\.co\.in/(.*?)/UI)', URL)
and a few variants of it. But it didn't work. What it the correct way to do this?
Try this !
Code :
import re
url = "https://epolicy.companyname.co.in/PRODUCTNAME/UI/PremiumCalculation.aspx?utm_source=rtb&utm_medium=display&utm_campaign=dbmew-Category-pros&dclid=CO2g3u7Gy98CFUOgaAodUv4E0w"
print(re.findall('https://[^/]*/(.*)/UI/(.*).aspx', url))
Output :
[('PRODUCTNAME', 'PremiumCalculation')]
Regex doesn't seem to be the right thing to use here at all. Rather, parse the URL, split the path, and get the first and last elements.
from urllib.parse import urlparse
from pathlib import PurePath
components = urlparse(url)
path = PurePath(url.path)
product_name = path.parts[1]
page = path.stem
What is the preferred way to cut off random characters at the end of a string in Python?
I am trying to simplify a list of URLs to do some analysis and therefore need to cut-off everything that comes after the file extension .php
Since the characters that follow after .php are different for each URL using strip() doesn't work. I thought about regex and substring(). But what would be the most efficient way to solve this task?
Example:
Let's say I have the following URLs:
example.com/index.php?random_var=random-19wdwka
example.org/index.php?another_var=random-2js9m2msl
And I want the output to be:
example.com/index.php
example.org/index.php
Thanks for your advice!
There are two ways to accomplish what you want.
If you know how the string ends:
In your example, if You know that the string ends with .php? then all you need to do is:
my_string.split('?')[0]
If you don't know how the string ends:
In this case you can use urlparse and take everything but the parameters.
from urlparse import urlparse
for url is urls:
p = urlparse(url)
print p.scheme + p.netloc + p.path
for url in urls:
result = url.split('?')[0]
print(result)
Split on your separator at most once, and take the first piece:
text="example.com/index.php?random_var=random-19wdwka"
sep="php"
rest = text.split(sep)[0]+".php"
print rest
It seems like what you really want are to strip away the parameters of the URL, you can also use
from urlparse import urlparse, urlunparse
urlunparse(urlparse(url)[:3] + ('', '', ''))
to replace the params, query and fragment parts of the URL with empty strings and generate a new one.
I have a crawler setup with Scrapy and am trying to process links. The problem is the links are embedded in Javascript and I am struggling to create a regular expression. Here are 3 samples of what I am trying to process:
javascript:openInIFrame('main', 'setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118')
javascript:window.open('overview.phtml?&.who=AAAAAAAAAAAA&.id=2', '43425235', 'menubar=no,toolbar=no,location=no,resizable=yes,maximize=yes');
javascript:openInIFrame('main', "page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7")
The resulting relative URL for each would be between the single/double quotes:
setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118
overview.phtml?&.who=AAAAAAAAAAAA&.id=2
page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7
I have tried variations of '(.*?)' and (["'])(?:(?=(\\?))\2.)*?\1 but cannot seem to get it right. What am I missing here?
maybe try something like this:
['"].*phtml.*['"]
http://regex101.com/r/lX6xX8/1
Try this
import re
url_regex = re.compile(r"(?:javascript:openInIFrame\('main',|javascript:window.open\()\s*(?:'|\")([^'\"]+)(?:'|\")")
samples = [
"javascript:openInIFrame('main', 'setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118')",
"javascript:window.open('overview.phtml?&.who=AAAAAAAAAAAA&.id=2', '43425235', 'menubar=no,toolbar=no,location=no,resizable=yes,maximize=yes');",
"javascript:openInIFrame('main', \"page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7\")"
]
for sample in samples:
md = url_regex.search(sample)
if md:
print md.group(1)
else:
print 'NO MATCH'
For me, this outputs:
setup.phtml%3f.op%3d3800%26.who%3dAAAAAAAAAAAA%26.menuItemRefNo=118
overview.phtml?&.who=AAAAAAAAAAAA&.id=2
page.phtml%3f.op%3d1499%26.who%3dAAAAAAAAAAAA%26.ifmod%3dtest&.menuItemRefNo=7
The trick is the ([^'\"]+). This captures any sequence of one or more characters, so long as the character is not a double or single quote. So basically, everything up to the end of the URL string, which is precisely the URL. Note that the \" is only necessary because the regex itself is delimited with "
I am trying to get the address of a facebook page of websites using regular expression search on the html
usually the link appears as
Facebook
but sometimes the address will be http://www.facebook.com/some.other
and sometimes with numbers
at the moment the regex that I have is
'(facebook.com)\S\w+'
but it won't catch the last 2 possibilites
what is it called when I want the regex to search but not fetch it? (for instance I want the regex to match the www.facbook.com part but not have that part in the result, only the part that comes after it
note I use python with re and urllib2
seems to me your main issue is that you dont understand enough regex.
fb_re = re.compile(r'www.facebook.com([^"]+)')
then simply:
results = fb_re.findall(url)
why this works:
in regular expresions the part in the parenthesis () is what is captured, you were putting the www.facebook.com part in the parenthesis and so it was not getting anything else.
here i used a character set [] to match anything in there, i used the ^ operator to negate that, which means anything not in the set, and then i gave it the " character, so it will match anything that comes after www.facebook.com until it reaches a " and then stop.
note - this catches facebook links which are embedded, if the facebook link is simply on the page in plaintext you can use:
fb_re = re.compile(r'www.facebook.com(\S+)')
which means to grab any non-white-space character, so it will stop once it runs out of white-space.
if you are worried about links ending in periods, you can simply add:
fb_re = re.compile(r'www.facebook.com(\S+)\.\s')
which tells it to search for the same above, but stop when it gets to the end of a sentence, . followed by any white-space like a space or enter. this way it will still grab links like /some.other but when you have things like /some.other. it will remove the last .
if i assume correctly, the url is always in double quotes. right?
re.findall(r'"http://www.facebook.com(.+?)"',url)
Overall, trying to parse html with regex is a bad idea. I suggest you use an html parser like lxml.html to find the links and then use urlparse
>>> from urlparse import urlparse # in 3.x use from urllib.parse import urlparse
>>> url = 'http://www.facebook.com/some.other'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'facebook.com'
>>> parse_object.path
'/some.other'