Find right URLs using Python and regex - python

I have a table with urls like
vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623
vk.com/albums54751623
vk.com/id36375649
vk.com/id36375649
I need to find all urls like vk.com/id36375649 (only id)
I try
for url in urls:
if url == re.compile('vk.com/^[a-z0-9]'):
print url
else:
continue
but this is uncorrectly, because it didn't return anything

You can use startswith:
strs = ['vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623',
'vk.com/albums54751623',
'vk.com/id36375649',
'vk.com/id36375649']
print([x for x in strs if x.startswith(r'vk.com/id')])
See the IDEONE demo
UPDATE
To address the issues stated in comments below this answer, you will have to use a regex with some checks:
^vk\.com/(?!album)\w+$
See the regex demo and a Python demo:
import re
strs = ['vk.com/albums54751623?z=photo54751623_341094858%2Fphotos54751623',
'vk.com/albums54751623',
'vk.com/id36375649',
'vk.com/id36375649',
'vk.com/id36375649?z=album-28413960_228518010',
'vk.com/tania_sevostianova'
]
print([x for x in strs if re.search(r'^vk\.com/(?!album)\w+$', x)])
# => ['vk.com/id36375649', 'vk.com/id36375649', 'vk.com/tania_sevostianova']

A regular expression like the following might work
vk.com\/id\d+
Remember that in regex you need to escape certain characters like slashes.

Related

how to use python re findall and regex so that 2 conditions can be run simultaneously?

here my data string :
MYDATA=DATANORMAL
MYDATA=DATA_NOTNORMAL
i use this code, but when i run it it shows empty at DATANORMAL
mydata = re.findall(r'MYDATA=(.*)' r'_.*', mystring)
print mydata
and it just shows : NOTNORMAL
i want both to work, and displays data like this:
DATANORMAL
NOTNORMAL
how do i do it? Thanks.
Try it online!
import re
mystring = """
MYDATA=DATANORMAL
MYDATA=DATA_NOTNORMAL
"""
mydata = re.findall(r'^\s*MYDATA=(?:.+_)?(.+?)\s*$', mystring, re.M)
print(mydata)
In case if you need word before _, not after, then use regex r'^\s*MYDATA=(.+?)(?:_.+)?\s*$' in code above, you may try this second variant here.
Based on what you describe, you might want to use an alternation here:
\bMYDATA=((?:DATA|(?:DATA_))\S+)\b
Script:
inp = "some text MYDATA=DATANORMAL more text MYDATA=DATA_NOTNORMAL"
mydata = re.findall(r'\bMYDATA=((?:DATA|(?:DATA_))\S+)\b', inp)
print(mydata)
This prints:
['DATANORMAL', 'DATA_NOTNORMAL']
I guess you need to add flags=re.M?
import re
mystring = """
MYDATA=DATANORMAL
MYDATA=DATA_NOTNORMAL"""
pattern = re.compile("MYDATA=(?:DATA_)?(\w+)",flags=re.M)
print(pattern.findall(mystring))

Deleting all occurances of '/' after its 2nd occurance in python

I have a URL string which is https://example.com/about/hello/
I want to split string as 'https://example.com', 'about' ,'hello'
How to do this ??
Use the urlparse to correctly parse a URL:
import urlparse
url = 'https://example.com/about/hello/'
parts = urlparse.urlparse(url)
paths = [p for p in parts.path.split('/') if p]
print 'Scheme:', parts.scheme # https
print 'Host:', parts.netloc # example.com
print 'Path:', parts.path # /about/hello/
print 'Paths:', paths # ['about', 'hello']
At the end of the day, the information you want are in the parts.scheme, parts.netloc and paths variables.
You may do this :
First split by '/'
Then join by '/' only before the 3rd occurance
Code:
text="https://example.com/about/hello/"
groups = text.split('/')
print( "/".join(groups[:3]),groups[3],groups[4])
Output:
https://example.com about hello
Inspired in Hai Vu's answer. This solution is for Python 3
from urllib.parse import urlparse
url = 'https://example.com/about/hello/'
parts = [p for p in urlparse(url).path.split('/') if p]
parts.insert(0, ''.join(url.split('/')[:3]))
There are lots of ways to do this. You could use re.split() to split on a regular expression, for instance.
>>> import re
>>> re.split(r'\b/\b', 'https://example.com/about/hello/')
['https://example.com', 'about', 'hello']
re is part of the standard library, documented here.
https://docs.python.org/3/library/re.html#re.split
The regex itself uses \b which means a boundy between a "word" character and a "non-word" character. You can use regex101 to explore how it works. https://regex101.com/r/mY8fV8/1

Complex regex in Python

I am trying to write a generic pattern using regex so that it fetches only particular things from the string. Let's say we have strings like GigabitEthernet0/0/0/0 or FastEthernet0/4 or Ethernet0/0.222. The regex should fetch the first 2 characters and all the numerals. Therefore, the fetched result should be something like Gi0000 or Fa04 or Et00222 depending on the above cases.
x = 'GigabitEthernet0/0/0/2
m = re.search('([\w+]{2}?)[\\\.(\d+)]{0,}',x)
I am not able to understand how shall I write the regular expression. The values can be fetched in the form of a list also. I write few more patterns but it isn't helping.
In regex, you may use re.findall function.
>>> import re
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join(re.findall(r'\d', s))
'Gi0000'
OR
>>> ''.join(re.findall(r'^..|\d', s))
'Gi0000'
>>> ''.join(re.findall(r'^..|\d', 'Ethernet0/0.222'))
'Et00222'
OR
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join([i for i in s if i.isdigit()])
'Gi0000'
z="Ethernet0/0.222."
print z[:2]+"".join(re.findall(r"(\d+)(?=[\d\W]*$)",z))
You can try this.This will make sure only digits from end come into play .
Here is another option:
s = 'Ethernet0/0.222'
"".join(re.findall('^\w{2}|[\d]+', s))

Using regular expression to extract string

I need to extract the IP address from the following string.
>>> mydns='ec2-54-196-170-182.compute-1.amazonaws.com'
The text to the left of the dot needs to be returned. The following works as expected.
>>> mydns[:18]
'ec2-54-196-170-182'
But it does not work in all cases. For e.g.
mydns='ec2-666-777-888-999.compute-1.amazonaws.com'
>>> mydns[:18]
'ec2-666-777-888-99'
How to I use regular expressions in python?
No need for regex... Just use str.split
mydns.split('.', 1)[0]
Demo:
>>> mydns='ec2-666-777-888-999.compute-1.amazonaws.com'
>>> mydns.split('.', 1)[0]
'ec2-666-777-888-999'
If you wanted to use regex for this:
Regex String
ec2-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3}).*
Alternative (EC2 Agnostic):
.*\b([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3}).*
Replacement String
Regular: \1.\2.\3.\4
Reverse: \4.\3.\2.\1
Python code
import re
subject = 'ec2-54-196-170-182.compute-1.amazonaws.com'
result = re.sub("ec2-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3}).*", r"\1.\2.\3.\4", subject)
print result
This regex will match (^[^.]+:
So Try this:
import re
string = "ec2-54-196-170-182.compute-1.amazonaws.com"
ip = re.findall('^[^.]+',string)[0]
print ip
Output:
ec2-54-196-170-182
Best thing is this will match even if the instance was ec2,ec3 so this regex is actually very much similar to the code of #mgilson

What am i doing wrong with this regular expression

links = re.findall('href="(http(s?)://[^"]+)"',page)
I have this regular expression to find all links in a website, I am getting this result:
('http://asecuritysite.com', '')
('https://www.sans.org/webcasts/archive/2013', 's')
When what I want is only this:
http://asecuritysite.com
https://www.sans.org/webcasts/archive/2013
If I eliminate the "( after the href it gives me loads of errors, can someone explain why?
If you use more than 1 capturing group, re.findall return list of tuples instead of list of strings. Try following (only using single group):
>>> import re
>>> page = '''
... here
... there
... '''
>>> re.findall(r'href="(https?:\/\/[^"]+)"',page)
['http://asecuritysite.com', 'https://www.sans.org/webcasts/archive/2013']
According to re.findall documentation:
If one or more groups are present in the pattern, return a list of
groups; this will be a list of tuples if the pattern has more than one
group.
Try getting rid of the second group (the (s?) in your original pattern):
links = re.findall('href="(https?:\/\/[^"]+)"',page)
What you are doing wrong is trying to parse HTML with Regex. And that sir, is a sin.
See here for the horrors of Regex parsing HTML
An alternative is to use something like lxml to parse the page and extract the links something like this
urls = html.xpath('//a/#href')
You're going to run into problems too if it's a single quote before the https? instead of double.
(https?:\/\/[^\"\'\>]+) will capture the entire string; what you could then do is prepend (href=.?) to it, and you'd end up with two capture groups:
Full regex: (href=.?)(https?:\/\/[^\"\'\>]+)
MATCH 1
[Group 1] href='
[Group 2] http://asecuritysite.com
MATCH 2
[Group 1] href='
[Group 2] https://www.sans.org/webcasts/archive/2013
http://regex101.com/r/gO8vV7 here is a working example

Categories

Resources