search substrings and replace them after analyzing - python

I've got multiple occurences of strings like this in my file:
%na^me%
%name^%
%^name%
....
I want to search every string like this in my file and replace it after analyzing the string.
For example
string `%^name%` will be replaced `Data`
string `%name^%` will be replaced with `DATA`
....
To find my substring I use this function with regex
re.findall('(?<=\%)(.*?)(?=\%)', data)
It finds substrings fine. But how to replace them?
The solution I see is to create map and iterate over to replace everyoccurrence with some value.
But is there a better way?

You don't need to go for re.findall. Just re.sub would be fine.
>>> s = '''%na^me%
%name^%
%^name%'''
>>> m = re.sub(r'(?<=%)\^.*?(?=%)', r'Data', s)
>>> f = re.sub(r'(?<=%).*?\^(?=%)', r'DATA', m)
>>> print(f)
%na^me%
%DATA%
%Data%
Update:
>>> m = re.sub(r'(?<=%)\^.*?(?=%)', r'Data', s)
>>> f = re.sub(r'(?<=%).*?\^(?=%)', r'DATA', m)
>>> j = re.sub(r'(?<=%).*?.\^..*(?=%)', r'datA', f)
>>> print(j)
%datA%
%DATA%
%Data%
If you want to replace % also, you could try this,
>>> m = re.sub(r'%\^.*?%', r'Data', s)
>>> f = re.sub(r'%.*?\^%', r'DATA', m)
>>> j = re.sub(r'%.*?.\^..*%', r'datA', f)
>>> print(j)
datA
DATA
Data

You can use the following pattern , note that you need to escape ^ with \. and instead look-around you can use grouping Also you need to use r before your pattern if you want to python interpret you pattern as regex format :
>>> s="""%na^me%
... %name^%
... %^name%"""
>>> l= re.findall(r'%([a-zA-Z\^]+)%',s)
['na^me', 'name^', '^name']
and for replace the string you can use a dictionary like the following , and replace your pattern with str.replace():
>>> d={'^name':'Data','name^':'DATA','na^me':'DAta'}
>>> for i in l :
... s=s.replace(i,d[i])
...
>>> s
'%DAta%\n%DATA%\n%Data%'

Related

python regular expression, pulling all letters out

Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length
You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)
I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']
Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']
You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>
If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?

Split string into tuple (Upper,lower) 'ABCDefgh' . Python 2.7.6

my_string = 'ABCDefgh'
desired = ('ABCD','efgh')
the only way I can think of doing this is creating a for loop and then scanning through and checking each element in the string individually and adding to string and then creating the tuple . . . is there a more efficient way to do this?
it will always be in the format UPPERlower
print re.split("([A-Z]+)",my_string)[1:]
Simple way (two passes):
>>> import itertools
>>> my_string = 'ABCDefgh'
>>> desired = (''.join(itertools.takewhile(lambda c:c.isupper(), my_string)), ''.join(itertools.dropwhile(lambda c:c.isupper(), my_string)))
>>> desired
('ABCD', 'efgh')
Efficient way (one pass):
>>> my_string = 'ABCDefgh'
>>> uppers = []
>>> done = False
>>> i = 0
>>> while not done:
... c = my_string[i]
... if c.isupper():
... uppers.append(c)
... i += 1
... else:
... done = True
...
>>> lowers = my_string[i:]
>>> desired = (''.join(uppers), lowers)
>>> desired
('ABCD', 'efgh')
Because I throw itertools.groupby at everything:
>>> my_string = 'ABCDefgh'
>>> from itertools import groupby
>>> [''.join(g) for k,g in groupby(my_string, str.isupper)]
['ABCD', 'efgh']
(A little overpowered here, but scales up to more complicated problems nicely.)
my_string='ABCDefg'
import re
desired = (re.search('[A-Z]+',my_string).group(0),re.search('[a-z]+',my_string).group(0))
print desired
A more robust approach without using re
import string
>>> txt = "ABCeUiioualfjNLkdD"
>>> tup = (''.join([char for char in txt if char in string.ascii_uppercase]),
''.join([char for char in txt if char not in string.ascii_uppercase]))
>>> tup
('ABCUNLD', 'eiioualfjkd')
the char not in string.ascii_uppercase instead of char in string.ascii_lowercase means that you'll never lose any data in case your string has non-letters in it, which could be useful if you suddenly start having errors when this input starts being rejected 20 function calls later.

How do I split the following string?

I have the following string where I need to extract only the first digits from it.
string = '50.2000\xc2\xb0 E'
How do I extract 50.2000 from string?
If the number can be followed by any kind of character, try using a regex:
>>> import re
>>> r = re.compile(r'(\d+\.\d+)')
>>> r.match('50.2000\xc2\xb0 E').group(1)
'50.2000'
mystring = '50.2000\xc2\xb0 E'
print mystring.split("\xc2", 1)[0]
Output
50.2000
If you just wanted to split the first digits, just slice the string:
start = 10 #start at the 10th digit
print mystring[start:]
Demo:
>>> my_string = 'abcasdkljf23u109842398470ujw{}{\\][\\['
>>> start = 10
>>> print(my_string[start:])
23u109842398470ujw{}{\][\[
You can, split the string at the first \:
>>> s = r'50.2000\xc2\xb0 E'
>>> s.split('\\', 1)
['50.2000', 'xc2\\xb0 E']
You could solve this using a regular expression:
In [1]: import re
In [2]: string = '50.2000\xc2\xb0 E'
In [3]: m = re.match('^([0-9]+\.?[0-9]*)', string)
In [4]: m.group(0)
Out[4]: '50.2000'

Python Regular expression repeat

I have a string like this
--x123-09827--x456-9908872--x789-267504
I am trying to get all value like
123:09827
456:9908872
789:267504
I've tried (--x([0-9]+)-([0-9])+)+
but it only gives me last pair result, I am testing it through python
>>> import re
>>> x = "--x123-09827--x456-9908872--x789-267504"
>>> p = "(--x([0-9]+)-([0-9]+))+"
>>> re.match(p,x)
>>> re.match(p,x).groups()
('--x789-267504', '789', '267504')
How should I write with nested repeat pattern?
Thanks a lot!
David
Code it like this:
x = "--x123-09827--x456-9908872--x789-267504"
p = "--x(?:[0-9]+)-(?:[0-9]+)"
print re.findall(p,x)
Just use the .findall method instead, it makes the expression simpler.
>>> import re
>>> x = "--x123-09827--x456-9908872--x789-267504"
>>> r = re.compile(r"--x(\d+)-(\d+)")
>>> r.findall(x)
[('123', '09827'), ('456', '9908872'), ('789', '267504')]
You can also use .finditer which might be helpful for longer strings.
>>> [m.groups() for m in r.finditer(x)]
[('123', '09827'), ('456', '9908872'), ('789', '267504')]
Use re.finditer or re.findall. Then you don't need the extra pair of parentheses that wrap the entire expression. For example,
>>> import re
>>> x = "--x123-09827--x456-9908872--x789-267504"
>>> p = "--x([0-9]+)-([0-9]+)"
>>> for m in re.finditer(p,x):
>>> print '{0} {1}'.format(m.group(1),m.group(2))
try this
p='--x([0-9]+)-([0-9]+)'
re.findall(p,x)
No need to use regex :
>>> "--x123-09827--x456-9908872--x789-267504".replace('--x',' ').replace('-',':').strip()
'123:09827 456:9908872 789:267504'
You don't need regular expressions for this. Here is a simple one-liner, non-regex solution:
>>> input = "--x123-09827--x456-9908872--x789-267504"
>>> [ x.replace("-", ":") for x in input.split("--x")[1:] ]
['123:09827', '456:9908872', '789:267504']
If this is an exercise on regex, here is a solution that uses the repetition (technically), though the findall(...) solution may be preferred:
>>> import re
>>> input = "--x123-09827--x456-9908872--x789-267504"
>>> regex = '--x(.+)'
>>> [ x.replace("-", ":") for x in re.match(regex*3, input).groups() ]
['123:09827', '456:9908872', '789:267504']

How do I coalesce a sequence of identical characters into just one?

Suppose I have this:
My---sun--is------very-big---.
I want to replace all multiple hyphens with just one hyphen.
import re
astr='My---sun--is------very-big---.'
print(re.sub('-+','-',astr))
# My-sun-is-very-big-.
If you want to replace any run of consecutive characters, you can use
>>> import re
>>> a = "AA---BC++++DDDD-EE$$$$FF"
>>> print(re.sub(r"(.)\1+",r"\1",a))
A-BC+D-E$F
If you only want to coalesce non-word-characters, use
>>> print(re.sub(r"(\W)\1+",r"\1",a))
AA-BC+DDDD-EE$FF
If it's really just hyphens, I recommend unutbu's solution.
If you really only want to coalesce hyphens, use the other suggestions. Otherwise you can write your own function, something like this:
>>> def coalesce(x):
... n = []
... for c in x:
... if not n or c != n[-1]:
... n.append(c)
... return ''.join(n)
...
>>> coalesce('My---sun--is------very-big---.')
'My-sun-is-very-big-.'
>>> coalesce('aaabbbccc')
'abc'
As usual, there's a nice itertools solution, using groupby:
>>> from itertools import groupby
>>> s = 'aaaaa----bbb-----cccc----d-d-d'
>>> ''.join(key for key, group in groupby(s))
'a-b-c-d-d-d'
How about:
>>> import re
>>> re.sub("-+", "-", "My---sun--is------very-big---.")
'My-sun-is-very-big-.'
the regular expression "-+" will look for 1 or more "-".
re.sub('-+', '-', "My---sun--is------very-big---")
How about an alternate without the re module:
'-'.join(filter(lambda w: len(w) > 0, 'My---sun--is------very-big---.'.split("-")))
Or going with Tim and FogleBird's previous suggestion, here's a more general method:
def coalesce_factory(x):
return lambda sent: x.join(filter(lambda w: len(w) > 0, sent.split(x)))
hyphen_coalesce = coalesce_factory("-")
hyphen_coalesce('My---sun--is------very-big---.')
Though personally, I would use the re module first :)
mcpeterson
Another simple solution is the String object's replace function.
while '--' in astr:
astr = astr.replace('--','-')
if you don't want to use regular expressions:
my_string = my_string.split('-')
my_string = filter(None, my_string)
my_string = '-'.join(my_string)
I have
my_str = 'a, b,,,,, c, , , d'
I want
'a,b,c,d'
compress all the blanks (the "replace" bit), then split on the comma, then if not None join with a comma in between:
my_str_2 = ','.join([i for i in my_str.replace(" ", "").split(',') if i])

Categories

Resources