Python extract username from URL - python

I'm scraping reddit usernames using Python and I'm trying to extract the username from an URL. The URL looks like this:
https://www.reddit.com/user/ExampleUser
This is my code:
def extract_username(url):
start = url.find('https://www.reddit.com/user/') + 28
end = url.find('?', start)
end2 = url.find("/", start)
return url[start:end] and url[start:end2] and url[start:]
The first part works but removing the question mark and forward slash doesen't. Maybe I'm using the "and" keyword wrong? Which means I sometimes get something like this:
ExampleUser/
ExampleUser/comments/
ExampleUser/submitted/
ExampleUser/gilded/
ExampleUser?sort=hot
ExampleUser?sort=new
ExampleUser?sort=top
ExampleUser?sort=controversial
I know I can use the api but i'd like to learn how to do it without. I've also heard about regular expressions but aren't they pretty slow?

You could use re module.
>>> s = "https://www.reddit.com/user/ExampleUser/comments/"
>>> import re
>>> re.search(r'https://www.reddit.com/user/([^/?]+)', s).group(1)
'ExampleUser'
[^/?]+ negated character class which matches any character but not of / or ? one or more times. () capturing group around the negated character class captures those matched characters. Later we could refer the captured characters through back-referencing (like \1 which refers the group index 1).
By defining a separate function.
>>> def extract_username(url):
... return re.search(r'https://www.reddit.com/user/([^/?]+)', url).group(1)
...
>>> extract_username('https://www.reddit.com/user/ExampleUser')
'ExampleUser'
>>> extract_username('https://www.reddit.com/user/ExampleUser/submitted/')
'ExampleUser'
>>> extract_username('https://www.reddit.com/user/ExampleUser?sort=controversial')
'ExampleUser'

This removes anything which follows a '?' and then splits on '/', retrieving the fifth element which is the user name:
>>> s = 'https://www.reddit.com/user/ExampleUser?sort=new'
>>> s.split('?')[0].split('/')[4]
'ExampleUser'
This also works on the other cases that you showed. For example:
>>> s = 'https://www.reddit.com/user/ExampleUser/comments/'
>>> s.split('?')[0].split('/')[4]
'ExampleUser'
>>> s = 'https://www.reddit.com/user/ExampleUser'
>>> s.split('?')[0].split('/')[4]
'ExampleUser'

Just for kicks, here's an example using find. Basically, you just want to take the minimum where you find your delimiter or the end if it's not found at all:
def extract_username(url):
username = url[len('https://www.reddit.com/user/'):]
end = min([i for i in (len(username),
username.find('/'),
username.find('?') ) if i >=0])
return username[:end]
for url in ('https://www.reddit.com/user/ExampleUser',
'https://www.reddit.com/user/ExampleUser/submitted/',
'https://www.reddit.com/user/ExampleUser?sort=controversial'):
print extract_username(url)

Related

Capture substring within a string - dynamically

I have a string:
ostring = "Ref('r1_featuring', ObjectId('5f475')"
What I am trying to do is search the string and check if it starts with Ref, if it does it should remove everything in the string and keep the substring 5f475.
I know this can be done using a simple replace like so:
string = ostring.replace("Ref('r1_featuring', ObjectId('", '').replace("')", '')
But I cannot do it this way as it needs to all be dynamic as there are going to be different strings each time. So I need to do it in a way that it will search the string and check if it starts with Ref, if it does then grab the alphanumeric value.
Desired Output:
5f475
Any help will be appreciated.
Like that?
>>> import re
>>> pattern = r"Ref.*'(.*)'\)$"
>>> m = re.match(pattern, "Ref('r1_featuring', ObjectId('5f475')")
>>> if m:
... print(m.group(1))
...
5f475
# >= python3.8
>>> if m := re.match(pattern, "Ref('r1_featuring', ObjectId('5f475')"):
... print(m.group(1))
...
5f475
a regex-free solution :)
ostring = "Ref('r1_featuring', ObjectId('5f475')"
if ostring.startswith("Ref"):
desired_part = ostring.rpartition("('")[-1].rpartition("')")[0]
str.rpartition

Complex regex in Python

I am trying to write a generic pattern using regex so that it fetches only particular things from the string. Let's say we have strings like GigabitEthernet0/0/0/0 or FastEthernet0/4 or Ethernet0/0.222. The regex should fetch the first 2 characters and all the numerals. Therefore, the fetched result should be something like Gi0000 or Fa04 or Et00222 depending on the above cases.
x = 'GigabitEthernet0/0/0/2
m = re.search('([\w+]{2}?)[\\\.(\d+)]{0,}',x)
I am not able to understand how shall I write the regular expression. The values can be fetched in the form of a list also. I write few more patterns but it isn't helping.
In regex, you may use re.findall function.
>>> import re
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join(re.findall(r'\d', s))
'Gi0000'
OR
>>> ''.join(re.findall(r'^..|\d', s))
'Gi0000'
>>> ''.join(re.findall(r'^..|\d', 'Ethernet0/0.222'))
'Et00222'
OR
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join([i for i in s if i.isdigit()])
'Gi0000'
z="Ethernet0/0.222."
print z[:2]+"".join(re.findall(r"(\d+)(?=[\d\W]*$)",z))
You can try this.This will make sure only digits from end come into play .
Here is another option:
s = 'Ethernet0/0.222'
"".join(re.findall('^\w{2}|[\d]+', s))

Regex in Python 2.7 for extraction of information from Snort log files

I'm trying to extract information from a Snort file using regular expressions. I've sucessfully got the IP's and SID, but I seem to be having trouble with extracting a specific part of the text.
How can I extract part of a Snort log file? The part I'm trying to extract can look like [Classification: example-of-attack] or [Classification: Example of Attack]. However, the first example may have any number of hyphens and whilst the second instance doesn't have any hyphens but contains some capital letters.
How could I extract just example-of-attack or Example-of-Attack?
I unfortunately only know how to search for static words such as:
test = re.search("exact-name", line)
t = test.group()
print t
I've tried many different commands on the web, but I just don't seem to get it.
You can use the following regex:
>>> m = re.search(r'\[Classification:\s*([^]]+)\]', line).group(1)
( Explanation | Working Demo )
You could use look-behinds,
>>> s = "[Classification: example-of-attack]"
>>> m = re.search(r'(?<=Classification: )[^\]]*', s)
>>> m
<_sre.SRE_Match object at 0x7ff54a954370>
>>> m.group()
'example-of-attack'
>>> s = "[Classification: Example of Attack]"
>>> m = re.search(r'(?<=Classification: )[^\]]*', s).group()
>>> m
'Example of Attack'
Use regex module if there are more than one spaces after the string Classification:,
>>> import regex
>>> s = "[Classification: Example of Attack]"
>>> regex.search(r'(?<=Classification:\s+\b)[^\]]*', s).group()
'Example of Attack
'
If you want to match any substring with the pattern [Word: Value], you could use the following regex,
ptrn = r"\[\s*(\w+):\s*([\w\s-]+)\s*\]"
Here I've used two groups, one for the first word ("Classification" in your question) and one for the second (either "example-of-attack" or "Example of Attack"). It also requires opening and closing square brackets. For example,
txt1 = '[Classification: example-of-attack]'
m = re.search( ptrn, txt1 )
>>> m.group(2)
'example-of-attack'

how to detect a repeated pattern in a string using re module of python

I'm trying to match a string using re module of python in which a pattern may repeat or not. The string starts with three alphabetical parts separated by :, then there is a = following with another alphabetical part. The string can finish here or continue to repeat patterns of alphabetical_part=alphabetical_part which are separated with a comma. Both samples are as below:
Finishes with just one repeat ==> aa:bb:cc=dd
Finishes with more than one repeat ==> aa:bb:cc=dd,ee=ff,gg=hh
As you see, there can't be a comma at the end of the string. I have wrote a pattern for matching this:
>>> pt = re.compile(r'\S+:\S+:[\S+=\S+$|,]+')
re.match returns a match object for this, but when I group the repeat pattern, I got something strange, see:
>>> st = 'xx:zz:rr=uu,ii=oo,ff=ee'
>>> pt = re.compile(r'\S+:\S+:([\S+=\S+$|,]+)')
>>> pt.findall(st)
['e']
I'm not sure if I wrote the right pattern or not; how can I check it? If it's wrong, what is the right answer though?
I think you want something like this,
>>> import re
>>> s = """ foo bar bar foo
xx:zz:rr=uu,ii=oo,ff=ee
aa:bb:cc=dd
xx:zz:rr=uu,ii=oo,ff=ee
bar foo"""
>>> m = re.findall(r'^[^=: ]+?[=:](?:[^=:,]+?[:=][^,\n]+?)(?:,[^=:,]+?[=:][^,\n]+?)*$', s, re.M)
>>> m
['xx:zz:rr=uu,ii=oo,ff=ee', 'aa:bb:cc=dd', 'xx:zz:rr=uu,ii=oo,ff=ee']
st = 'xx:zz:rr=uu,ii=oo,ff=ee'
m = re.findall(r'\w+:\w+:(\w+=\w+)((?:,\w+=\w+)*)', st )
>>> m
[('rr=uu', ',ii=oo,ff=ee')]
Don't use \S because this will also match :. It's better to use \w
Or :
re.findall(r'\w+:\w+:(\w+=\w+(?:,\w+=\w+)*)', st )[0].split(',')
# This will return: ['rr=uu', 'ii=oo', 'ff=ee']
Here's a more readable regex that should work for you:
\S+?:\S+?:(?:\S+?=\S+?.)+
It makes use of a non-capturing group (?:...) and the plus + repeat token to match on one or more of the "alphabetical_part=alphabetical_part"
Example:
>>> import re
>>> str = """ foo bar
... foo bar bar foo
... xx:zz:rr=uu,ii=oo,ff=ee
... aa:bb:cc=dd
... xx:zz:rr=uu,ii=oo,ff=ee
... bar foo """
>>> pat = re.compile(ur'\S+?:\S+?:(?:\S+?=\S+?.)+')
>>> re.findall(pat, str)
['xx:zz:rr=uu,ii=oo,ff=ee', 'aa:bb:cc=dd', 'xx:zz:rr=uu,ii=oo,ff=ee']

search patterns with variable gaps in python

I am looking for patterns in a list containing different strings as:
names = ['TAATGH', 'GHHKLL', 'TGTHA', 'ATGTTKKKK', 'KLPPNF']
I would like to select the string that has the pattern 'T--T' (no matter how the string starts), so those elements would be selected and appended to a new list as:
namesSelected = ['TAATGH', 'ATGTTKKKK']
Using grep I could:
grep "T[[:alpha:]]\{2\}T"
Is there a similar mode in re python?
Thanks for any help!
I think this is most likely what you want:
re.search(r'T[A-Z]{2}T', inputString)
The equivalent in Python for [[:alpha:]] would be [a-zA-Z]. You may replace [A-Z] with [a-zA-Z] in the code snippet above if you wish to allow lowercase alphabet.
Documentation for re.search.
Yep, you can use re.search:
>>> names = ['TAATGH', 'GHHKLL', 'TGTHA', 'ATGTTKKKK', 'KLPPNF']
>>> reslist = []
>>> for i in names:
... res = re.search(r'T[A-Z]{2}T', i)
... if res:
... reslist.append(i)
...
>>>
>>> print(reslist)
['TAATGH', 'ATGTTKKKK']
import re
def grep(l, pattern):
r = re.compile(pattern)
return [_ for _ in l if r.search(pattern)]
nameSelected = grep(names, "T\w{2}T")
Note the use of \w instead of [[:alpha:]]

Categories

Resources