replace certain not known words in a string

replace certain not known words in a string - python

I'm looking for a more elegant solution to replace some upfront not known words in a string, except not,and and or:
Only as an example below, but could be anything but will always be evaluable with eval())
input: (DEFINE_A or not(DEFINE_B and not (DEFINE_C))) and DEFINE_A
output: (self.DEFINE_A or not(self.DEFINE_B and not (self.DEFINE_C))) and self.DEFINE_A
I created a solution, but it looks kind a strange. Is there a more clean way?
s = '(DEFINE_A or not(DEFINE_B and not (DEFINE_C))) and DEFINE_A'
words = re.findall(r'[\w]+|[()]*|[ ]*', s)
for index, word in enumerate(words):
w = re.findall('^[a-zA-Z_]+$', word)
if w and w[0] not in ['and','or','not']:
z = 'self.' + w[0]
words[index] = z
new = ''.join(str(x) for x in words)
print(new)
Will print correctly:
(self.DEFINE_A or not(self.DEFINE_B and not (self.DEFINE_C))) and self.DEFINE_A

First of all, you can match only words by using a simple \w+. Then, Using a negative lookahead you can exclude the ones you don't want. Now all that's left to do is use re.sub directly with that pattern:
s = '(DEFINE_A or not(DEFINE_B and not (DEFINE_C))) and DEFINE_A'
new = re.sub(r"(?!and|not|or)\b(\w+)", r"self.\1", s)
print(new)
Which will give:
(self.DEFINE_A or not(self.DEFINE_B and not (self.DEFINE_C))) and self.DEFINE_A
You can test-out and see how this regex works here.
If the names of your "variables" will always be capitalized, this simplifies the pattern a bit and making it much more efficient. Simply use:
new = re.sub(r"([A-Z\d_]+)", r"self.\1", s)
This is not only a simpler pattern (for readability), but also much more efficient. On this example, it only takes 70 steps compared to 196 of the original (can be seen in the top-right corner in the links).
You can see the new pattern in action here.

Related

remove gibberish prefix from a string

a = "aajfkdfvf_valid_name0"
b = "gdhdhsdsdeeeeex_valid_name1"
How do I remove the gibberish from my string before valid so that I have something like this -
valid_name0
valid_name1

If your strings always contains valid word, then you can try something like -
a = "aajfkdfvf_valid_name0"
b = "gdhdhsdsdeeeeex_valid_name1"
for s in (a, b):
print(s[s.rfind('valid'):])
So, even if the prefix contains _ or substring valid in it, the output will be correct. Though if your valid substring contains the word valid multiple times, then this will not work

We can try using re.sub here:
a = "aajfkdfvf_valid_name0"
b = "gdhdhsdsdeeeeex_valid_name1"
inp = [a, b]
output = [re.sub(r'^[^_]+_', '', i) for i in inp]
print(output) # ['valid_name0', 'valid_name1']

You can use a split join approach for this.
Try this:
a = "aajfkdfvf_valid_name0"
valid_a = '_'.join(a.split('_')[1:])
# 'valid_name0'
# can use maxsplit to split only once at the first _ and then take the remaining part of the string
another_valid_a = a.split('_',1)[1]
# valid_name0
Basically what this is doing is that it is splitting the original string at the _, then ignoring the first element and joining the remaining part again using _.

The other approaches seem a bit too over-engineered for this task, at least in my opinion.
If you already know that the gibberish comes before the first underscore _ character, you can just do a single str.split and discard the first split result:
a = "aajfkdfvf_valid_name0"
b = "gdhdhsdsdeeeeex_valid_name1"
def clean_string(s: str) -> str:
return s.split('_', 1)[1]
print(clean_string(a)) # valid_name0
print(clean_string(b)) # valid_name1

If you're sure that just a '_' is your need, a string split will help:
fixed_a = '_'.join(a.split('_')[1:])
The worst case is that this pattern is not the only one you're looking at. Then, check this:
You need to know exactly what your 'valid_name' looks like, you could make a REGEX to achieve your need.
Check for standards, patterns and all those.
I'm pretty sure if is there a pattern, a Regex can handle.
I recommend this site to do so.

How to write regex to fix words composed of duplicate letters?

I scraped a few pdfs and some thick fonts get scraped as in this example:
text='and assesses oouurr rreeffoorrmmeedd tteeaacchhiinngg in the classroom'
instead of
"and assesses our reformed teaching in the classroom"
How to fix this? I am trying with regex
pattern=r'([a-z])(?=\1)'
re.sub(pattern,'',text)
#"and aseses reformed teaching in the clasrom"
I am thinking of grouping the two groups above and add word boundaries
EDIT: this one fixes words with even number of letters:
pattern=r'([a-z])\1([a-z])\2'
re.sub(pattern,'\1\2',text)
#"and assesses oouurr reformed teaching in the classroom"

If letters are duplicated, you can try something like this
for w in text.split():
if len(w) %2 != 0:
print(w)
continue
if w[0::2] == w[1::2]:
print(w[0::2])
continue
print(w)

I am using a mixed approach: build the pattern and substitution in a for loop, then applying regex. The regexes applied go from e.g. words of 8x2=16 letters down to 3.
import re
text = 'and assesses oouurr rreeffoorrmmeedd tteeaacchhiinngg in the classroom'
wrd_len = [9,8,7,6,5,4,3,2]
for l in wrd_len:
sub = '\\' + '\\'.join(map(str,range(1,l+1)))
pattern = '([a-z])\\' + '([a-z])\\'.join(map(str,range(1,l+1)))
text = re.sub(pattern, sub , text)
text
#and assesses our reformed teaching in the classroom
For example, the regex for 3-letter words becomes:
re.sub('([a-z])\1([a-z])\2([a-z])\3', '\1\2\3', text)
As a side note, I could not get those backslashes right with raw strings, and I am actually going to use [a-zA-Z].

i found solution in javascript that works fine :
([a-z])\1(?:(?=([a-z])\2)|(?<=\3([a-z])\1\1))
but in some how it doesn't work in python because lookbehind can't take references to group so i came up with another solution that can work in this example :
([a-z])\1(?:(?=([a-z])\2)|(?=[^a-z])))
try it here

Trying to convert a nested loop with two sequences into a lambda

i've got this function that checks all the words in the 1st sequence,
if they are ending with one of the words in the 2nd sequence, remove that end substring.
I'm trying to achieve all that in one simple lambda function that is supposed to go into a pipeline processing, and can't find a way to do it.
I'll be grateful if you could help me with this:
str_test = ("Thiship is a test string testing slowly i'm helpless")
stem_rules = ('less', 'ship', 'ing', 'es', 'ly','s')
str_test2 = str_test.split()
for i in str_test2:
for j in stem_rules:
if(i.endswith(j)):
str_test2[str_test2.index(i)] = i[:-len(j)]
break

This is a one-liner that activates a (simple?) lambda that does it.
(lambda words, rules: sum([[word[:-len(rule)]] if word.endswith(rule) else [] for word in words for rule in rules], []))(str_test.split(), stem_rules)
It's not clear how it's working, and it's not good practice to do it.
What it generally does is create a list with a single string out of matches, or an empty list out of misses, and then aggregates everything to single list, containing only the matches.
Currently it will output on every match, and not just longest match or anything like that, but once you figure out how it's working, maybe you can select the shortest match from the list of matches for each word in the input.
May god be with you.

The first thing I'd do is toss your i.endswith(j) for j in stem_rules out and make it a regex that matches and captures the prefix string and matches (but doesn't capture) any suffix
import re
match_end = re.compile("(.*?)(?:" + "|".join(".*?" + stem + "$" for stem in stem_rules) + ")")
# This is the same as:
re.compile(r"""
(.*?) # Capturing group matching the prefix
(?: # Begins a non-capturing group...
stem1$|
stem2$|
stem3$ # ...which matches an alternation of the stems, asserting end of string
) # ends the non-capturing group""", re.X)
Then you can use that regex to sub each item in the list.
f = lambda word: match_end.sub(r"\1", word)
Use that wrapped in a list comprehension and you should have your result
words = [f(word) for word in str_test.split()]
# or map(f, str_test.split())

To convert you current code into a single lambda, each step in the pipeline needs to behave in a very functional manner: receive some data, and then emit some data. You need to avoid anything that deviates from that paradigm -- in particular, the use of things like break. Here's one way to rewrite the steps in that manner:
text = ("Thiship is a test string testing slowly i'm helpless")
stems = ('less', 'ship', 'ing', 'es', 'ly','s')
# The steps:
# - get words from the text
# - pair each word with its matching stems
# - create a list of cleaned words (stems removed)
# - make the new text
words = text.split()
wstems = [ (w, [s for s in stems if w.endswith(s)]) for w in words ]
cwords = [ w[0:-len(ss[0])] if ss else w for w, ss in wstems ]
text2 = ' '.join(cwords)
print text2
With those parts in hands, a single lambda can be created using ordinary substitution. Here's the monstrosity:
f = lambda txt: [
w[0:-len(ss[0])] if ss else w
for w, ss in [ (w, [s for s in stems if w.endswith(s)]) for w in txt.split() ]
]
text3 = ' '.join(f(text))
print text3
I wasn't sure whether you want the lambda to return the new words or the new text -- adjust as needed.

Smart pythonic way of removing if elif on regular expressions

I have a series of reg expressions called in order. I need to check the first one, and then the second, then the third etc etc right the way until the end. I need to do some processed on the matched string, so I'm trying to avoid too much logic, but in python, unlike perl I do not think I can perform assignment in the if-elif-elif..blocks so I'll end up doing an assignment, then checking for a match and then getting the results of that match. For example:
m = re.search(patternA, string)
if m:
stripped = m.group(0)
xyz = stripped[45:67]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
abc = stripped[5:7]
elif:
m = re.search(patternB, string)
if m:
stripped = m.group(0)
txt = stripped[4:5]
elif:
......
Ideally I'd like to find a better structure that ensures I preserve the ordering of the tested regular expressions, and also that I can incorporate the assignment into the if-then statements. So for example:
if (m = re.search(patternA, string)):
stripped = m.group(0)
xyz = stripped[45:67]
elif (m = re.search(patternB, string)):
stripped = m.group(0)
abc = stripped[5:7]
...
What is the most pythonic way of dealing with this? Thanks.
The use case is to read old data - very old data. However each string may include information about particular values and these are only present if the regular expression matches a particular pattern. So the variables extracted are highly dependent upon what matches.

for (pattern, slice) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)]):
m = re.search(pattern, string)
if m:
value = m.group(0)[slice]
break
else:
# Handle no match found for any pattern here
This iterates over pairs of regular expressions and the relevant portion of their match until a match is found. If there is no match found, the else clause of the for loop will execute. The result of the match is found in value after the loop, regardless of which pattern matches.
Having different variables set based on which "branch" succeeds is not a great idea, since you won't necessarily know which variables are set at any given time. A dictionary would be a better idea if you really want separate labels for each match, since you can query which key or keys are set in a dictionary.
value = {}
for (pattern, slice, key) in zip([patternA, patternB, patternC],
[slice(45,67), slice(5,7), slice(4,5)],
['abc', 'xyx', 'txt']):
m = re.search(pattern, string)
if m:
value[key] = m.group(0)[slice]
break
The general idea, though, is to note that your chain of if statements is like a hard-coded iteration, so you just need to identify which parts of each if/elif clause varies from the preceding ones, and create a list that you can iterate over instead.

Replacing reoccuring characters in strings in Python 3.1

Is it possible to replace a single character inside a string that occurs many times?
Input:
Sentence=("This is an Example. Thxs code is not what I'm having problems with.") #Example input
^
Sentence=("This is an Example. This code is not what I'm having problems with.") #Desired output
Replace the 'x' in "Thxs" with an i, without replacing the x in "Example".

You can do it by including some context:
s = s.replace("Thxs", "This")
Alternatively you can keep a list of words that you don't wish to replace:
whitelist = ['example', 'explanation']
def replace_except_whitelist(m):
s = m.group()
if s in whitelist: return s
else: return s.replace('x', 'i')
s = 'Thxs example'
result = re.sub("\w+", replace_except_whitelist, s)
print(result)
Output:
This example

Sure, but you essentially have to build up a new string out of the parts you want:
>>> s = "This is an Example. Thxs code is not what I'm having problems with."
>>> s[22]
'x'
>>> s[:22] + "i" + s[23:]
"This is an Example. This code is not what I'm having problems with."
For information about the notation used here, see good primer for python slice notation.

If you know whether you want to replace the first occurrence of x, or the second, or the third, or the last, you can combine str.find (or str.rfind if you wish to start from the end of the string) with slicing and str.replace, feeding the character you wish to replace to the first method, as many times as it is needed to get a position just before the character you want to replace (for the specific sentence you suggest, just one), then slice the string in two and replace only one occurrence in the second slice.
An example is worth a thousands words, or so they say. In the following, I assume you want to substitute the (n+1)th occurrence of the character.
>>> s = "This is an Example. Thxs code is not what I'm having problems with."
>>> n = 1
>>> pos = 0
>>> for i in range(n):
>>> pos = s.find('x', pos) + 1
...
>>> s[:pos] + s[pos:].replace('x', 'i', 1)
"This is an Example. This code is not what I'm having problems with."
Note that you need to add an offset to pos, otherwise you will replace the occurrence of x you have just found.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

replace certain not known words in a string - python

Related

remove gibberish prefix from a string

How to write regex to fix words composed of duplicate letters?

Trying to convert a nested loop with two sequences into a lambda

Smart pythonic way of removing if elif on regular expressions

Replacing reoccuring characters in strings in Python 3.1

Categories

Resources