regex and replace on string using python

regex and replace on string using python - python

I am rather new to Python Regex (regex in general) and I have been encountering a problem. So, I have a few strings like so:
str1 = r'''hfo/gfbi/mytag=a_17014b_82c'''
str2 = r'''/bkyhi/oiukj/game/?mytag=a_17014b_82c&'''
str3 = r'''lkjsd/image/game/mytag=a_17014b_82c$'''
the & and the $ could be any symbol.
I would like to have a single regex (and replace) which replaces:
mytag=a_17014b_82c
to:
mytag=myvalue
from any of the above 3 strings. Would appreciate any guidance on how I can achieve this.
UPDATE: the string to be replaced is always not the same. So, a_17014b_82c could be anything in reality.

If the string to be replaced is constant you don't need a regex. Simply use replace:
>>> str1 = r'''hfo/gfbi/mytag=a_17014b_82c'''
>>> str1.replace('a_17014b_82c','myvalue')
'hfo/gfbi/mytag=myvalue'

Use re.sub:
>>> import re
>>> r = re.compile(r'(mytag=)(\w+)')
>>> r.sub(r'\1myvalue', str1)
'hfo/gfbi/mytag=myvalue'
>>> r.sub(r'\1myvalue', str2)
'/bkyhi/oiukj/game/?mytag=myvalue&'
>>> r.sub(r'\1myvalue', str3)
'lkjsd/image/game/mytag=myvalue$'

import re
r = re.compile(r'(mytag=)\w+$')
r.sub(r'\1myvalue', str1)
This is based on #Ashwini's answer, two small changes are we are saying the mytag=a_17014b part should be at the end of input, so that even inputs such as
str1 = r'''/bkyhi/mytag=blah/game/?mytag=a_17014b_82c&'''
will work fine, substituting the last mytag instead of the the first.
Another small change is we are not unnecessarily capturing the \w+, since we aren't using it anyway. This is just for a bit of code clarity.

Related

Why is the split() returning list objects that are empty? [duplicate]

I have the following file names that exhibit this pattern:
000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...
I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:
time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
But this gives me two extra empty strings in the returned list:
time_info=['', '20111007T084734', '20111008T023142', '']
How do I get only the two time stamp information? i.e. I want:
time_info=['20111007T084734', '20111008T023142']

I'm no Python expert but maybe you could just remove the empty strings from your list?
str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)

Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.
>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')
You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')

If the timestamps are always after the second _ then you can use str.split and str.strip:
>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']

Since this came up on google and for completeness, try using re.findall as an alternative!
This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.
Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.

>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']
or, somewhat more general:
>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Complex regex in Python

I am trying to write a generic pattern using regex so that it fetches only particular things from the string. Let's say we have strings like GigabitEthernet0/0/0/0 or FastEthernet0/4 or Ethernet0/0.222. The regex should fetch the first 2 characters and all the numerals. Therefore, the fetched result should be something like Gi0000 or Fa04 or Et00222 depending on the above cases.
x = 'GigabitEthernet0/0/0/2
m = re.search('([\w+]{2}?)[\\\.(\d+)]{0,}',x)
I am not able to understand how shall I write the regular expression. The values can be fetched in the form of a list also. I write few more patterns but it isn't helping.

In regex, you may use re.findall function.
>>> import re
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join(re.findall(r'\d', s))
'Gi0000'
OR
>>> ''.join(re.findall(r'^..|\d', s))
'Gi0000'
>>> ''.join(re.findall(r'^..|\d', 'Ethernet0/0.222'))
'Et00222'
OR
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join([i for i in s if i.isdigit()])
'Gi0000'

z="Ethernet0/0.222."
print z[:2]+"".join(re.findall(r"(\d+)(?=[\d\W]*$)",z))
You can try this.This will make sure only digits from end come into play .

Here is another option:
s = 'Ethernet0/0.222'
"".join(re.findall('^\w{2}|[\d]+', s))

Using regular expression to extract string

I need to extract the IP address from the following string.
>>> mydns='ec2-54-196-170-182.compute-1.amazonaws.com'
The text to the left of the dot needs to be returned. The following works as expected.
>>> mydns[:18]
'ec2-54-196-170-182'
But it does not work in all cases. For e.g.
mydns='ec2-666-777-888-999.compute-1.amazonaws.com'
>>> mydns[:18]
'ec2-666-777-888-99'
How to I use regular expressions in python?

No need for regex... Just use str.split
mydns.split('.', 1)[0]
Demo:
>>> mydns='ec2-666-777-888-999.compute-1.amazonaws.com'
>>> mydns.split('.', 1)[0]
'ec2-666-777-888-999'

If you wanted to use regex for this:
Regex String
ec2-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3}).*
Alternative (EC2 Agnostic):
.*\b([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3}).*
Replacement String
Regular: \1.\2.\3.\4
Reverse: \4.\3.\2.\1
Python code
import re
subject = 'ec2-54-196-170-182.compute-1.amazonaws.com'
result = re.sub("ec2-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3})-([0-9]{1,3}).*", r"\1.\2.\3.\4", subject)
print result

This regex will match (^[^.]+:
So Try this:
import re
string = "ec2-54-196-170-182.compute-1.amazonaws.com"
ip = re.findall('^[^.]+',string)[0]
print ip
Output:
ec2-54-196-170-182
Best thing is this will match even if the instance was ec2,ec3 so this regex is actually very much similar to the code of #mgilson

python: how to remove '$'?

All I want to do is remove the dollar sign '$'. This seems simple, but I really don't know why my code isn't working.
import re
input = '$5'
if '$' in input:
input = re.sub(re.compile('$'), '', input)
print input
Input still is '$5' instead of just '5'! Can anyone help?

Try using replace instead:
input = input.replace('$', '')
As Madbreaks has stated, $ means match the end of the line in a regular expression.
Here is a handy link to regular expressions: http://docs.python.org/2/library/re.html

In this case, I'd use str.translate
>>> '$$foo$$'.translate(None,'$')
'foo'
And for benchmarking purposes:
>>> def repl(s):
... return s.replace('$','')
...
>>> def trans(s):
... return s.translate(None,'$')
...
>>> import timeit
>>> s = '$$foo bar baz $ qux'
>>> print timeit.timeit('repl(s)','from __main__ import repl,s')
0.969965934753
>>> print timeit.timeit('trans(s)','from __main__ import trans,s')
0.796354055405
There are a number of differences between str.replace and str.translate. The most notable is that str.translate is useful for switching 1 character with another whereas str.replace replaces 1 substring with another. So, for problems like, I want to delete all characters a,b,c, or I want to change a to d, I suggest str.translate. Conversely, problems like "I want to replace the substring abc with def" are well suited for str.replace.
Note that your example doesn't work because $ has special meaning in regex (it matches at the end of a string). To get it to work with regex you need to escape the $:
>>> re.sub('\$','',s)
'foo bar baz qux'
works OK.

$ is a special character in regular expressions that translates to 'end of the string'
you need to escape it if you want to use it literally
try this:
import re
input = "$5"
if "$" in input:
input = re.sub(re.compile('\$'), '', input)
print input

You need to escape the dollar sign - otherwise python thinks it is an anchor http://docs.python.org/2/library/re.html
import re
fred = "$hdkhsd%$"
print re.sub ("\$","!", fred)
>> !hdkhsd%!

Aside from the other answers, you can also use strip():
input = input.strip('$')

Python regular words cut

I have string: './money.log_rotated.27.10.2011_17:15:01:[27-10-2011 17:07:02]'
I need string: '27-10-2011 17:07:02'
How can i do this in python?

There are many ways to do this, one way is to use str.partition:
text='./money.log_rotated.27.10.2011_17:15:01:[27-10-2011 17:07:02]'
before,_,after = text.partition('[')
print(after[:-1])
# 27-10-2011 17:07:02
Another is to use str.split:
before,after = text.split('[',1)
print(after[:-1])
# 27-10-2011 17:07:02
or str.find and str.rfind:
ind1 = text.find('[')+1
ind2 = text.rfind(']')
print(text[ind1:ind2])
All these methods rely on the desired substring immediately following the first left-bracket [.
The first two methods also rely on the desired substring ending at the next-to-last character in text. The last method (using rfind) searches from the right for the index of the right-bracket, so it is a little more general, and does not depend on quite so many (potential off-by-one) constants.

If your string has always the same structure this is probably the simplest solution:
s = r'./money.log_rotated.27.10.2011_17:15:01:[27-10-2011 17:07:02]'
s[s.find("[")+1:s.find("]")]
Update:
After seeing some of the other answers this is a slight improvement:
s[s.find("[")+1:-1]
Exploiting the fact that the closing square bracket is the last character in your string.

If the format is "fixed", you can also use this
>>> s = './money.log_rotated.27.10.2011_17:15:01:[27-10-2011 17:07:02]'
>>> s[-20:-1:]
'27-10-2011 17:07:02'
>>>

You can also use regular expression:
import re
s = './money.log_rotated.27.10.2011_17:15:01:[27-10-2011 17:07:02]'
print re.search(r'\[(.*?)\]', s).group(1)

Try with a regex :
import re
re.findall(".*\[(.*)\]", './money.log_rotated.27.10.2011_17:15:01:[27-10-2011 17:07:02]')
>>> ['27-10-2011 17:07:02']

Probably the easiest way(if you know the string will always be in this format
>>> s = './money.log_rotated.27.10.2011_17:15:01:[27-10-2011 17:07:02]'
>>> s[s.index('[') + 1:-1]
'27-10-2011 17:07:02'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex and replace on string using python - python

If the string to be replaced is constant you don't need a regex. Simply use replace: >>> str1 = r'''hfo/gfbi/mytag=a_17014b_82c''' >>> str1.replace('a_17014b_82c','myvalue') 'hfo/gfbi/mytag=myvalue'

Use re.sub: >>> import re >>> r = re.compile(r'(mytag=)(\w+)') >>> r.sub(r'\1myvalue', str1) 'hfo/gfbi/mytag=myvalue' >>> r.sub(r'\1myvalue', str2) '/bkyhi/oiukj/game/?mytag=myvalue&' >>> r.sub(r'\1myvalue', str3) 'lkjsd/image/game/mytag=myvalue$'

Related

Why is the split() returning list objects that are empty? [duplicate]

Complex regex in Python

Using regular expression to extract string

python: how to remove '$'?

Python regular words cut

Categories

Resources