How to read access-log hosts with regex?

How to read access-log hosts with regex? - python

I have such entries:
e179206120.adsl.alicedsl.de
safecamp-plus-2098.unibw-hamburg.de
p5B30EBFE.dip0.t-ipconnect.de
and I would like to match only the main domain names like
alicedsl.de
unibw-hamburg.de
t-ipconnect.de
I tried this \.\w+\.\w+\.\w{2,3} but that matches .adsl.alicedsl.de

How about [^.]+\.\w+$
See it work
Or, in Python:
import re
tgt='''\
e179206120.adsl.alicedsl.de
safecamp-plus-2098.unibw-hamburg.de
p5B30EBFE.dip0.t-ipconnect.de'''
print re.findall(r'([^.]+\.\w+$)', tgt, re.M | re.S)
# ['alicedsl.de', 'unibw-hamburg.de', 't-ipconnect.de']
Regex explanation:
[^.]+ 1 or more characters EXCEPT a literal .
\. literal . It needs the \ because it would be any chaarcter to regex if not used
\w+ 1 or more characters in the ranges of [a-z] [A-Z] [0-9] [_] Potentially a better regex for TLD's in ASCII is [a-zA-Z]+ since there aren't any old TLD's that are not ASCII. If you want to manage newer Internationalized TLD's, you need a different regex.
$ assertion for the end of the line
You should know that you definition of TLD's is incomplete. For example, this regex approach will break on the legitimate url of bbc.co.uk and many others that include a common SLD. Use a library if you can for more general applicability. You can also use the mozilla list of TLD and SLD's to know when it is appropriate to include two periods in the definition of host.

You could use the following with your given data.
[^.]+\.[^.]+$
See Live demo

If you dont have restrictions on using external libraries, check out TLD extract library
https://pypi.python.org/pypi/tldextract
import tldextract
for input in ["e179206120.adsl.alicedsl.de", "safecamp-plus-2098.unibw-hamburg.de", "p5B30EBFE.dip0.t-ipconnect.de"]:
input_tld = tldextract.extract(input)
print input_tld.domain+"."+input_tld.suffix

You actually do not need Regex for this. A list comprehension will be far more efficient:
>>> mystr = """
... e179206120.adsl.alicedsl.de
... safecamp-plus-2098.unibw-hamburg.de
... p5B30EBFE.dip0.t-ipconnect.de
... """
>>> [".".join(line.rsplit(".", 2)[-2:]) for line in mystr.splitlines() if line]
['alicedsl.de', 'unibw-hamburg.de', 't-ipconnect.de']
>>>
Also, if you want it, here is a reference on Python's string methods (it explains str.splitlines, str.rsplit, and str.join).
If you run a speed test using timeit.timeit, you will see that the list comprehension is much faster:
>>> from timeit import timeit
>>> mystr = """
... e179206120.adsl.alicedsl.de
... safecamp-plus-2098.unibw-hamburg.de
... p5B30EBFE.dip0.t-ipconnect.de
... """
>>> def func():
... import re
... re.findall(r'([^.]+\.\w+$)', mystr, re.M | re.S)
...
>>> timeit("func()", "from __main__ import func") # Regex's time
51.85605544838802
>>> def func():
... [".".join(line.rsplit(".", 2)[-2:]) for line in mystr.splitlines() if line]
...
>>> timeit("func()", "from __main__ import func") # List comp.'s time
12.113929004943316
>>>

Related

how to detect a repeated pattern in a string using re module of python

I'm trying to match a string using re module of python in which a pattern may repeat or not. The string starts with three alphabetical parts separated by :, then there is a = following with another alphabetical part. The string can finish here or continue to repeat patterns of alphabetical_part=alphabetical_part which are separated with a comma. Both samples are as below:
Finishes with just one repeat ==> aa:bb:cc=dd
Finishes with more than one repeat ==> aa:bb:cc=dd,ee=ff,gg=hh
As you see, there can't be a comma at the end of the string. I have wrote a pattern for matching this:
>>> pt = re.compile(r'\S+:\S+:[\S+=\S+$|,]+')
re.match returns a match object for this, but when I group the repeat pattern, I got something strange, see:
>>> st = 'xx:zz:rr=uu,ii=oo,ff=ee'
>>> pt = re.compile(r'\S+:\S+:([\S+=\S+$|,]+)')
>>> pt.findall(st)
['e']
I'm not sure if I wrote the right pattern or not; how can I check it? If it's wrong, what is the right answer though?

I think you want something like this,
>>> import re
>>> s = """ foo bar bar foo
xx:zz:rr=uu,ii=oo,ff=ee
aa:bb:cc=dd
xx:zz:rr=uu,ii=oo,ff=ee
bar foo"""
>>> m = re.findall(r'^[^=: ]+?[=:](?:[^=:,]+?[:=][^,\n]+?)(?:,[^=:,]+?[=:][^,\n]+?)*$', s, re.M)
>>> m
['xx:zz:rr=uu,ii=oo,ff=ee', 'aa:bb:cc=dd', 'xx:zz:rr=uu,ii=oo,ff=ee']

st = 'xx:zz:rr=uu,ii=oo,ff=ee'
m = re.findall(r'\w+:\w+:(\w+=\w+)((?:,\w+=\w+)*)', st )
>>> m
[('rr=uu', ',ii=oo,ff=ee')]
Don't use \S because this will also match :. It's better to use \w
Or :
re.findall(r'\w+:\w+:(\w+=\w+(?:,\w+=\w+)*)', st )[0].split(',')
# This will return: ['rr=uu', 'ii=oo', 'ff=ee']

Here's a more readable regex that should work for you:
\S+?:\S+?:(?:\S+?=\S+?.)+
It makes use of a non-capturing group (?:...) and the plus + repeat token to match on one or more of the "alphabetical_part=alphabetical_part"
Example:
>>> import re
>>> str = """ foo bar
... foo bar bar foo
... xx:zz:rr=uu,ii=oo,ff=ee
... aa:bb:cc=dd
... xx:zz:rr=uu,ii=oo,ff=ee
... bar foo """
>>> pat = re.compile(ur'\S+?:\S+?:(?:\S+?=\S+?.)+')
>>> re.findall(pat, str)
['xx:zz:rr=uu,ii=oo,ff=ee', 'aa:bb:cc=dd', 'xx:zz:rr=uu,ii=oo,ff=ee']

Using regex to find a string starting with /team/ and ending with /Euro_2012

Hi simple question...
I want to find all strings that basically match the following pattern:
/team/*/Euro_2012
So it should find:
/team/Croatia/Euro_2012
/team/Netherlands/Euro_2012
But not:
/team/Netherlands/WC2014
How do I write this in Regex for Python using re.compile?

Simple enough:
re.findall(r'/team/.*?/Euro_2012', inputtext)
You may want to limit the permissible characters between /team/ and /Euro_2012 to reduce the chances of false positives in larger text:
re.findall(r'/team/[\w\d%.~+-/]*?/Euro_2012', inputtext)
which only allows for valid URI characters.
Demo:
>>> import re
>>> sample = '''\
... /team/Croatia/Euro_2012
... /team/Netherlands/Euro_2012
... /team/Netherlands/WC2014
... '''
>>> re.findall(r'/team/.*?/Euro_2012', sample)
['/team/Croatia/Euro_2012', '/team/Netherlands/Euro_2012']
>>> re.findall(r'/team/[\w\d%.~+-/]*?/Euro_2012', sample)
['/team/Croatia/Euro_2012', '/team/Netherlands/Euro_2012']

Python regex parsing string containing braced items

So I have a set of strings which look like:
Callable {option-1} {option-2} {option-3} {option-n}
Callable
Callable {option-1}
There may be none or n options.
What I want to do is to parse out the options from this string in a list ([option-1, option-2, option-3, option-n]), or None if there were no braced options. What is the best way of doing it? At present I do lots of split('{') and then strip/clean the output. This feels very ugly.
What is the clean(est) method for doing this?

Use re.findall():
re.findall(r'{([^}]+)}', inputtext)
This pattern matches anything that isn't a closing brace as the option text; alternatively, you can use word characters, digits and dashes:
re.findall(r'{([\w\d-]+)}', inputtext)
Demo:
>>> import re
>>> samples = '''\
... Callable {option-1} {option-2} {option-3} {option-n}
... Callable
... Callable {option-1}
... '''
>>> for line in samples.splitlines():
... print re.findall(r'{([^}]+)}', line)
...
['option-1', 'option-2', 'option-3', 'option-n']
[]
['option-1']
This produces lists of matches; no matches results in an empty list.

python: how to remove '$'?

All I want to do is remove the dollar sign '$'. This seems simple, but I really don't know why my code isn't working.
import re
input = '$5'
if '$' in input:
input = re.sub(re.compile('$'), '', input)
print input
Input still is '$5' instead of just '5'! Can anyone help?

Try using replace instead:
input = input.replace('$', '')
As Madbreaks has stated, $ means match the end of the line in a regular expression.
Here is a handy link to regular expressions: http://docs.python.org/2/library/re.html

In this case, I'd use str.translate
>>> '$$foo$$'.translate(None,'$')
'foo'
And for benchmarking purposes:
>>> def repl(s):
... return s.replace('$','')
...
>>> def trans(s):
... return s.translate(None,'$')
...
>>> import timeit
>>> s = '$$foo bar baz $ qux'
>>> print timeit.timeit('repl(s)','from __main__ import repl,s')
0.969965934753
>>> print timeit.timeit('trans(s)','from __main__ import trans,s')
0.796354055405
There are a number of differences between str.replace and str.translate. The most notable is that str.translate is useful for switching 1 character with another whereas str.replace replaces 1 substring with another. So, for problems like, I want to delete all characters a,b,c, or I want to change a to d, I suggest str.translate. Conversely, problems like "I want to replace the substring abc with def" are well suited for str.replace.
Note that your example doesn't work because $ has special meaning in regex (it matches at the end of a string). To get it to work with regex you need to escape the $:
>>> re.sub('\$','',s)
'foo bar baz qux'
works OK.

$ is a special character in regular expressions that translates to 'end of the string'
you need to escape it if you want to use it literally
try this:
import re
input = "$5"
if "$" in input:
input = re.sub(re.compile('\$'), '', input)
print input

You need to escape the dollar sign - otherwise python thinks it is an anchor http://docs.python.org/2/library/re.html
import re
fred = "$hdkhsd%$"
print re.sub ("\$","!", fred)
>> !hdkhsd%!

Aside from the other answers, you can also use strip():
input = input.strip('$')

Python: Regex to find but not include an alphanumeric

Is there an regular expression to find, for example, ">ab" but do not include ">" in the result?
I want to replace some strings using re.sub, and I want to find strings starting with ">" without remove the ">".

You want a positive lookbehind assertion. See the docs.
r'(?<=>)ab'
It needs to be a fixed length expression, it can't be a variable number of characters. Basically, do
r'(?<=stringiwanttobebeforethematch)stringiwanttomatch'
So, an example:
import re
# replace 'ab' with 'e' if it has '>' before it
#here we've got '>ab' so we'll get '>ecd'
print re.sub(r'(?<=>)ab', 'e', '>abcd')
#here we've got 'ab' but no '>' so we'll get 'abcd'
print re.sub(r'(?<=>)ab', 'e', 'abcd')

You can use a back reference in sub:
import re
test = """
>word
>word2
don't replace
"""
print re.sub('(>).*', r'\1replace!', test)
Outputs:
>replace!
>replace!
don't replace
I believe this accomplishes what you actually want when you say "I want to replace some strings using re.sub, and I want to find strings starting with '>' without remove the '>'."

if you want to avoid using the re module you can also use the startswith() string method.
>>> foo = [ '>12', '>54', '34' ]
>>> for line in foo:
... if line.startswith('>'):
... line = line.strip('>')
... print line
...
12
54
34
>>>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read access-log hosts with regex? - python

I have such entries: e179206120.adsl.alicedsl.de safecamp-plus-2098.unibw-hamburg.de p5B30EBFE.dip0.t-ipconnect.de and I would like to match only the main domain names like alicedsl.de unibw-hamburg.de t-ipconnect.de I tried this \.\w+\.\w+\.\w{2,3} but that matches .adsl.alicedsl.de

You could use the following with your given data. [^.]+\.[^.]+$ See Live demo

Related

how to detect a repeated pattern in a string using re module of python

Using regex to find a string starting with /team/ and ending with /Euro_2012

Python regex parsing string containing braced items

python: how to remove '$'?

Python: Regex to find but not include an alphanumeric

Categories

Resources