for example is the string is "abbacdeffel" and the pattern being "xyyx" replaced with "1234"
so it would result from "abbacdeffel" to "1234cd1234l"
I have tried to think this out but I couldnt come up with anything. At first I thought maybe dictionary could help but still nothing came to mind.
What you're looking to do can be accomplished by using regex, or more commonly known as, Regular Expressions. Regular Expressions in programming enables you to extract what you want and just what you want from a string.
In your case, you want to match the string with the pattern abba so using the following regex:
(\w+)(\w+)\2\1
https://regex101.com/r/hP8lA3/1
You can match two word groups and use backreferences to make sure that the second group comes first, then the first group.
So implementing this in python code looks like this:
First, import the regex module in python
import re
Then, declare your variable
text = "abbacdeffel"
The re.finditer returns an iterable so you can iterate through all the groups
matches = re.finditer(r"(\w)(\w)\2\1", text)
Go through all the matches that the regexp found and replace the pattern with "1234"
for match in matches:
text = text.replace(match.group(0), "1234")
For debugging:
print(text)
Complete Code:
import re
text = "abbacdeffel"
matches = re.finditer(r"(\w)(\w)\2\1", text)
for match in matches:
text = text.replace(match.group(0), "1234")
print(text)
You can learn more about Regular Expressions here: https://regexone.com/references/python
New version of code (there was a bug):
def replace_with_pattern(pattern, line, replace):
from collections import OrderedDict
set_of_chars_in_pattern = set(pattern)
indice_start_pattern = 0
output_line = ""
while indice_start_pattern < len(line):
potential_end_pattern = indice_start_pattern + len(pattern)
subline = line[indice_start_pattern:potential_end_pattern]
print(subline)
set_of_chars_in_subline = set(subline)
if len(set_of_chars_in_subline)!= len(set_of_chars_in_pattern):
output_line += line[indice_start_pattern]
indice_start_pattern +=1
continue
map_of_chars = OrderedDict()
liste_of_chars_in_pattern = []
for char in pattern:
if char not in liste_of_chars_in_pattern:
liste_of_chars_in_pattern.append(char)
print(liste_of_chars_in_pattern)
for subline_char in subline:
if subline_char not in map_of_chars.values():
map_of_chars[liste_of_chars_in_pattern.pop(0)] =subline_char
print(map_of_chars)
wanted_subline = ""
for char_of_pattern in pattern:
wanted_subline += map_of_chars[char_of_pattern]
print("wanted_subline =" + wanted_subline)
if subline == wanted_subline:
output_line += replace
indice_start_pattern += len(pattern)
else:
output_line += line[indice_start_pattern]
indice_start_pattern += 1
return output_line
some test :
test1 = replace_with_pattern("xyyx", "abbacdeffel", "1234")
test2 = replace_with_pattern("abbacdeffel", "abbacdeffel", "1234")
print(test1, test2)
=> 1234cd1234l 1234
Here goes my attempt:
([a-zA-Z])(?!\1)([a-zA-Z])\2\1
Assuming you want to match letters only (if other ranges, change both [a-zA-Z] as appropriate, we have:
([a-zA-Z])
Find the first character, and note it so we can later refer to it with \1.
(?!\1)
Check to see if the next character is not the same as the first, but without advancing the search pointer. This is to prevent aaaa being accepted. If aaaa is OK, just remove this subexpression.
([a-zA-Z])
Find the second character, and note it so we can later refer to it with \2.
\2\1
Now find the second again, then the first again, so we match the full abba pattern.
And finally, to do a replace operation, the full command would be:
import re
re.sub(r'([a-zA-Z])(?!\1)([a-zA-Z])\2\1',
'1234',
'abbacdeffelzzzz')
The r at the start of the regex pattern is to prevent Python processing the backslashes. Without it, you would need to do:
import re
re.sub('([a-zA-Z])(?!\\1)([a-zA-Z])\\2\\1',
'1234',
'abbacdeffelzzzz')
Now, I see the spec has expanded to a user-defined pattern; here is some code that will build that pattern:
import re
def make_re(pattern, charset):
result = ''
seen = []
for c in pattern:
# Is this a letter we've seen before?
if c in seen:
# Yes, so we want to match the captured pattern
result += '\\' + str(seen.index(c)+1)
else:
# No, so match a new character from the charset,
# but first exclude already matched characters
for i in xrange(len(seen)):
result += '(?!\\' + str(i + 1) + ')'
result += '(' + charset + ')'
# Note we have seen this letter
seen.append(c)
return result
print re.sub(make_re('xzzx', '\\d'), 'abba', 'abba1221b99999889')
print re.sub(make_re('xyzxyz', '[a-z]'), '123123', 'abcabc zyxzyyx zyzzyz')
Outputs:
abbaabbab9999abba
123123 zyxzyyx zyzzyz
Related
I have been using a code to remove a parenthesized substring, which goes like this;
def alkylrem(j):
removed = ''
paren_level = 0
for char in j:
if char == '(':
paren_level += 1
elif (char == ')') and paren_level:
paren_level -= 1
elif not paren_level:
removed += char
return removed
But I find this inadequate since I also need to retrieve the parenthesized substring and maybe store it in a global variable. Can I add an 'elif' statement to do that? or is there a better way to do this?
You could use a regular expressions pattern to extract the text inside the prenthesis and a different pattern to remove it.
import re
j = "some sentence (with) parnethesis"
def alkylrem(j):
pat = re.compile(r'\((.*?)\)') # pattern for getting the contents
pat2 = re.compile(r'(?<=\().*?(?=\))') # pattern for removing the contents
contents = pat.findall(j) # finds whatever is in the parenthesis
contents_remove = ''.join(pat2.split(j)) # removes contents
return contents[0], contents_remove # return both
print(alkylrem(a))
output
'with', 'some sentence () parnethesis'
If you aren't familiar with regex, another method would be to simply find the index of the open and close parenthesis and reconstruct the string based off that.
def alkyrem(j):
opener = j.index('(')
closer = j.index(')')
contents = j[opener+1:closer]
removed = j[:opener+1] + j[closer:]
return contents, removed
Neither solution would work if you have nested parenthesis.
I am massaging strings so that the 1st letter of the string and the first letter following either a dash or a slash needs to be capitalized.
So the following string:
test/string - this is a test string
Should look look like so:
Test/String - This is a test string
So in trying to solve this problem my 1st idea seems like a bad idea - iterate the string and check every character and using indexing etc. determine if a character follows a dash or slash, if it does set it to upper and write out to my new string.
def correct_sentence_case(test_phrase):
corrected_test_phrase = ''
firstLetter = True
for char in test_phrase:
if firstLetter:
corrected_test_phrase += char.upper()
firstLetter = False
#elif char == '/':
else:
corrected_test_phrase += char
This just seems VERY un-pythonic. What is a pythonic way to handle this?
Something along the lines of the following would be awesome but I can't pass in both a dash and a slash to the split:
corrected_test_phrase = ' - '.join(i.capitalize() for i in test_phrase.split(' - '))
Which I got from this SO:
Convert UPPERCASE string to sentence case in Python
Any help will be appreciated :)
I was able to accomplish the desired transformation with a regular expression:
import re
capitalized = re.sub(
'(^|[-/])\s*([A-Za-z])', lambda match: match[0].upper(), phrase)
The expression says "anywhere you match either the start of the string, ^, or a dash or slash followed by maybe some space and a word character, replace the word character with its uppercase."
demo
If you don't want to go with a messy splitting-joining logic, go with a regex:
import re
string = 'test/string - this is a test string'
print(re.sub(r'(^([a-z])|(?<=[-/])\s?([a-z]))',
lambda match: match.group(1).upper(), string))
# Test/String - This is a test string
Using double split
import re
' - '.join([i.strip().capitalize() for i in re.split(' - ','/'.join([i.capitalize() for i in re.split('/',test_phrase)]))])
I'm using that:
import string
last = 'pierre-GARCIA'
if last not in [None, '']:
last = last.strip()
if '-' in last:
last = string.capwords(last, sep='-')
else:
last = string.capwords(last, sep=None)
I am defining a regex to match my defined identifiers - an identifier has to start with a letter followed by any number of letters, numbers, and underscores.
I have my current regex r'[A-Za-z][A-Za-z0-9_]*' and it works great except for cases like this: if I send in: testid#entifier_, it returns a match for testid and entifier_. I want it to completely reject the identifier. Not match parts of it.
It just ends up splitting them.
What can I do without using a complex look-ahead for legal chars?
Input is simply:
arg = sys.argv[1]
file = open(arg)
inLines = file.read()
file.close()
tokens = lexer(inLines, tokenFormats)
A sample of my defined regex's are like this:
tokenFormats = [
(r'[\s\n\t]+', None), #Whitespace
(r'\/\*(\*(?!\/)|[^*])*\*\/', None), #Comment
(r'\(', LParent),
(r'\)', RParent),
(r'\[', LBracket),
(r'\]', RBracket),
(r'\{', LBrace),
(r'\}', RBrace),
(r'\,', CommaT),
(r'(?<="{1}).*?(?=")', STRLITERAL),
(r'\"', QuoteT),
(r'\.', PeriodT),
(r'\-?[0-9]*\.[0-9]+', ValueR),
(r'\+', AddT),
(r'-', AddT),
(r'\|\|', AddT),
(r';', Semicolon),
My matching loop is like this:
def lexer(input, tokenFormats):
pos = 0
tokens = []
while pos < len(input):
match = None
for tokenFormat in tokenFormats:
pattern, tag = tokenFormat
regex = re.compile(pattern)
match = regex.match(input,pos) #Essentially Build Lexeme
if match:
lexeme = match.group(0)
if tag:
if tag == Identifier and len(str(lexeme)) > 27: #rough fix to check length. Very hacky
sys.stderr.write('Illegal length for identifier: %s\n' % lexeme)
break;
attr = checkForAttribute(lexeme,tag)
token = (lexeme,tag,attr)
tokens.append(token)
break
else:
break
if not match:
sys.stderr.write('Illegal or unknown character: %s\n' % input[pos])
pos = pos + 1
else:
pos = match.end(0)
return tokens
Try anchoring your expression:
r'^[A-Za-z][A-Za-z0-9_]*$'
This requires that the entire identifier matches the expression, not just part of it because you are anchoring the expression to the beginning and end of the string. This prevents part of the string from matching.
If the # symbol is your only concern, try this r'[a-zA-Z]#?[a-zA-Z0-9_]+'.
If you want to allow the # as well you could use the following regex:
r'[A-Za-z][A-Za-z0-9_]*#?[A-Za-z0-9_]*'
tested: https://regex101.com/r/vlt8qo/3/
however following the description of your problem:
I am defining a regex to match my defined identifiers - an identifier has to start with a letter followed by any number of letters, numbers, and underscores.
it looks like that there is some incoherence since # are not defined as part of your identifiers...
Following your edit in the post:
I have adapted my regex to ->
r'(?<=[\(\)\]\[\-=\+\s\n\t,;\|\.\"])[A-Za-z][A-Za-z0-9_]*(?=[\(\)\]\[\-=\+\s\n\t,;\|\.\"])|^[A-Za-z][A-Za-z0-9_]*(?=[\(\)\]\[\-=\+\s\n\t,;\|\.\"])'
and tested it on several patterns #
https://regex101.com/r/vlt8qo/5/
The POS tagger that I use processes the following string
3+2
as shown below.
3/num++/sign+2/num
I'd like to split this result as follows using python.
['3/num', '+/sign', '2/num']
How can I do that?
Use re.split -
>>> import re
>>> re.split(r'(?<!\+)\+', '3/num++/sign+2/num')
['3/num', '+/sign', '2/num']
The regex pattern will split on a + sign as long as no other + precedes it.
(?<! # negative lookbehind
\+ # plus sign
)
\+ # plus sign
Note that lookbehinds (in general) do not support varying length patterns.
The tricky part I believe is the double + sign. You can replace the signs with special characters and get it done.
This should work,
st = '3/num++/sign+2/num'
st = st.replace('++', '#$')
st = st.replace('+', '#')
st = st.replace('$', '+')
print (st.split('#'))
One issue with this is that, your original string cannot contain those special characters # & $. So you will need to carefully choose them for your use case.
Edit: This answer is naive. The one with regex is better
That is, as pointed out by COLDSPEED, you should use the following regex approach with lookbehind,
import re
print re.split(r'(?<!\+)\+', '3/num++/sign+2/num')
Although the ask was to use regex, here is an example on how to do this with standard .split():
my_string = '3/num++/sign+2/num'
my_list = []
result = []
# enumerate over the split string
for e in my_string.split('/'):
if '+' in e:
if '++' in e:
#split element on double + and add in + as well
my_list.append(e.split('++')[0])
my_list.append('+')
else:
#split element on single +
my_list.extend(e.split('+'))
else:
#add element
my_list.append(e)
# at this point my_list contains
# ['3', 'num', '+', 'sign', '2', 'num']
# enumerate on the list, steps of 2
for i in range(0, len(my_list), 2):
#add result
result.append(my_list[i] + '/' + my_list[i+1])
print('result', result)
# result returns
# ['3/num', '+/sign', '2/num']
I have a file with lines of this form:
ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName
and I would like to capture the names in quotes "" after ClientsName(0) = and ClientsName(1) =.
So far, I came up with this code
import re
f = open('corrected_clients_data.txt', 'r')
result = ''
re_name = "ClientsName\(0\) = (.*)"
for line in f:
name = re.search(line, re_name)
print (name)
which is returning None at each line...
Two sources of error can be: the backslashes and the capture sequence (.*)...
You can do that more easily using re.findall and using \d instead of 0 to make it more general:
import re
s = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> print re.findall(r'ClientsName\(\d\) = "([^"]*)"', s)
['SUPERBRAND', 'GREATSTUFF']
Another thing you must note is that your order of arguments to search() or findall() is wrong. It should be as follows: re.search(pattern, string)
You can use re.findall and just take the first two matches:
>>> s = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> re.findall(r'\"([^"]+)\"' , s)[:2]
['SUPERBRAND', 'GREATSTUFF']
try this
import re
text_file = open("corrected_clients_data.txt", "r")
text = text_file.read()
matches=re.findall(r'\"(.+?)\"',text)
text_file.close()
if you notice the question mark(?) indicates that we have to stop reading the string
at the first ending double quotes encountered.
hope this is helpful.
Use a lookbehind to get the value of ClientsName(0) and ClientsName(1) through re.findall function,
>>> import re
>>> str = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> m = re.findall(r'(?<=ClientsName\(0\) = \")[^"]*|(?<=ClientsName\(1\) = \")[^"]*', str)
>>> m
['SUPERBRAND', 'GREATSTUFF']
Explanation:
(?<=ClientsName\(0\) = \") Positive lookbehind is used to set the matching marker just after to the string ClientsName(0) = "
[^"]* Then it matches any character not of " zero or more times. So it match the first value ie, SUPERBRAND
| Logical OR operator used to combine two regexes.
(?<=ClientsName\(1\) = \")[^"]* Matches any character just after to the string ClientsName(1) = " upto the next ". Now it matches the second value ie, GREATSTUFF