i've got this function that checks all the words in the 1st sequence,
if they are ending with one of the words in the 2nd sequence, remove that end substring.
I'm trying to achieve all that in one simple lambda function that is supposed to go into a pipeline processing, and can't find a way to do it.
I'll be grateful if you could help me with this:
str_test = ("Thiship is a test string testing slowly i'm helpless")
stem_rules = ('less', 'ship', 'ing', 'es', 'ly','s')
str_test2 = str_test.split()
for i in str_test2:
for j in stem_rules:
if(i.endswith(j)):
str_test2[str_test2.index(i)] = i[:-len(j)]
break
This is a one-liner that activates a (simple?) lambda that does it.
(lambda words, rules: sum([[word[:-len(rule)]] if word.endswith(rule) else [] for word in words for rule in rules], []))(str_test.split(), stem_rules)
It's not clear how it's working, and it's not good practice to do it.
What it generally does is create a list with a single string out of matches, or an empty list out of misses, and then aggregates everything to single list, containing only the matches.
Currently it will output on every match, and not just longest match or anything like that, but once you figure out how it's working, maybe you can select the shortest match from the list of matches for each word in the input.
May god be with you.
The first thing I'd do is toss your i.endswith(j) for j in stem_rules out and make it a regex that matches and captures the prefix string and matches (but doesn't capture) any suffix
import re
match_end = re.compile("(.*?)(?:" + "|".join(".*?" + stem + "$" for stem in stem_rules) + ")")
# This is the same as:
re.compile(r"""
(.*?) # Capturing group matching the prefix
(?: # Begins a non-capturing group...
stem1$|
stem2$|
stem3$ # ...which matches an alternation of the stems, asserting end of string
) # ends the non-capturing group""", re.X)
Then you can use that regex to sub each item in the list.
f = lambda word: match_end.sub(r"\1", word)
Use that wrapped in a list comprehension and you should have your result
words = [f(word) for word in str_test.split()]
# or map(f, str_test.split())
To convert you current code into a single lambda, each step in the pipeline needs to behave in a very functional manner: receive some data, and then emit some data. You need to avoid anything that deviates from that paradigm -- in particular, the use of things like break. Here's one way to rewrite the steps in that manner:
text = ("Thiship is a test string testing slowly i'm helpless")
stems = ('less', 'ship', 'ing', 'es', 'ly','s')
# The steps:
# - get words from the text
# - pair each word with its matching stems
# - create a list of cleaned words (stems removed)
# - make the new text
words = text.split()
wstems = [ (w, [s for s in stems if w.endswith(s)]) for w in words ]
cwords = [ w[0:-len(ss[0])] if ss else w for w, ss in wstems ]
text2 = ' '.join(cwords)
print text2
With those parts in hands, a single lambda can be created using ordinary substitution. Here's the monstrosity:
f = lambda txt: [
w[0:-len(ss[0])] if ss else w
for w, ss in [ (w, [s for s in stems if w.endswith(s)]) for w in txt.split() ]
]
text3 = ' '.join(f(text))
print text3
I wasn't sure whether you want the lambda to return the new words or the new text -- adjust as needed.
Related
Task
Write a program that will decode the secret message by reversing text
between square brackets. The message may contain nested brackets (that
is, brackets within brackets, such as One[owT[Three[ruoF]]]). In
this case, innermost brackets take precedence, similar to parentheses
in mathematical expressions, e.g. you could decode the aforementioned
example like this:
One[owT[Three[ruoF]]]
One[owT[ThreeFour]]
One[owTruoFeerhT]
OneThreeFourTwo
In order to make your own task slightly easier and less tricky, you
have already replaced all whitespaces in the original text with
underscores (“_”) while copying it from the paper version.
Input description
The first and only line of the standard input
consists of a non-empty string of up to 2 · 106 characters which may
be letters, digits, basic punctuation (“,.?!’-;:”), underscores (“_”)
and square brackets (“[]”). You can safely assume that all square
brackets are paired correctly, i.e. every opening bracket has exactly
one closing bracket matching it and vice versa.
Output description
The standard output should contain one line – the
decoded secret message without any square brackets.
Example
For sample input:
A[W_[y,[]]oh]o[dlr][!]
the correct output is:
Ahoy,_World!
Explanation
This example contains empty brackets. Of course, an empty string, when
reversed, remains empty, so we can simply ignore them. Then, as
previously, we can decode this example in stages, first reversing the
innermost brackets to obtain A[W_,yoh]o[dlr][!]. Afterwards, there
are no longer any nested brackets, so the remainder of the task is
trivial.
Below is my program that doesn't quite work
word = input("print something: ")
word_reverse = word[::-1]
while("[" in word and "]" in word):
open_brackets_index = word.index("[")
close_brackets_index = word_reverse.index("]")*(-1)-1
# print(word)
# print(open_brackets_index)
# print(close_brackets_index)
reverse_word_into_quotes = word[open_brackets_index+1:close_brackets_index:][::-1]
word = word[:close_brackets_index]
word = word[:open_brackets_index]
word = word+reverse_word_into_quotes
word = word.replace("[","]").replace("]","[")
print(word)
print(word)
Unfortunately my code only works with one pair of parentheses and I don't know how to fix it.
Thank you in advance for your help
Assuming the re module can be used, this code does the job:
import re
text = 'A[W_[y,[]]oh]o[dlr][!]'
# This scary regular expresion does all the work:
# It says find a sequence that starts with [ and ends with ] and
# contains anything BUT [ and ]
pattern = re.compile('\[([^\[\]]*)\]')
while True:
m = re.search(pattern, text)
if m:
# Here a single pattern like [String], if any, is replaced with gnirtS
text = re.sub(pattern, m[1][::-1], text, count=1)
else:
break
print(text)
Which prints this line:
Ahoy,_World!
I realize the my previous answer has been accepted but, for completeness, I'm submitting a second solution that does NOT use the re module:
text = 'A[W_[y,[]]oh]o[dlr][!]'
def find_pattern(text):
# Find [...] and return the locations of [ (start) ] (end)
# and the in-between str (content)
content = ''
for i,c in enumerate(text):
if c == '[':
content = ''
start = i
elif c == ']':
end = i
return start, end, content
else:
content += c
return None, None, None
while True:
start, end, content = find_pattern(text)
if start is None:
break
# Replace the content between [] with its reverse
text = "".join((text[:start], content[::-1], text[end+1:]))
print(text)
I'm looking for a more elegant solution to replace some upfront not known words in a string, except not,and and or:
Only as an example below, but could be anything but will always be evaluable with eval())
input: (DEFINE_A or not(DEFINE_B and not (DEFINE_C))) and DEFINE_A
output: (self.DEFINE_A or not(self.DEFINE_B and not (self.DEFINE_C))) and self.DEFINE_A
I created a solution, but it looks kind a strange. Is there a more clean way?
s = '(DEFINE_A or not(DEFINE_B and not (DEFINE_C))) and DEFINE_A'
words = re.findall(r'[\w]+|[()]*|[ ]*', s)
for index, word in enumerate(words):
w = re.findall('^[a-zA-Z_]+$', word)
if w and w[0] not in ['and','or','not']:
z = 'self.' + w[0]
words[index] = z
new = ''.join(str(x) for x in words)
print(new)
Will print correctly:
(self.DEFINE_A or not(self.DEFINE_B and not (self.DEFINE_C))) and self.DEFINE_A
First of all, you can match only words by using a simple \w+. Then, Using a negative lookahead you can exclude the ones you don't want. Now all that's left to do is use re.sub directly with that pattern:
s = '(DEFINE_A or not(DEFINE_B and not (DEFINE_C))) and DEFINE_A'
new = re.sub(r"(?!and|not|or)\b(\w+)", r"self.\1", s)
print(new)
Which will give:
(self.DEFINE_A or not(self.DEFINE_B and not (self.DEFINE_C))) and self.DEFINE_A
You can test-out and see how this regex works here.
If the names of your "variables" will always be capitalized, this simplifies the pattern a bit and making it much more efficient. Simply use:
new = re.sub(r"([A-Z\d_]+)", r"self.\1", s)
This is not only a simpler pattern (for readability), but also much more efficient. On this example, it only takes 70 steps compared to 196 of the original (can be seen in the top-right corner in the links).
You can see the new pattern in action here.
I have a string and rules/mappings for replacement and no-replacements.
E.g.
"This is an example sentence that needs to be processed into a new sentence."
"This is a second example sentence that shows how 'sentence' in 'sentencepiece' should not be replaced."
Replacement rules:
replace_dictionary = {'sentence': 'processed_sentence'}
no_replace_set = {'example sentence'}
Result:
"This is an example sentence that needs to be processed into a new processed_sentence."
"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
Additional criteria:
Only replace if case is matched, i.e. case matters.
Whole words replacement only, interpunction should be ignored, but kept after replacement.
I was thinking what would the cleanest way to solve this problem in Python 3.x be?
Based on the answer of demongolem.
UPDATE
I am sorry, I missed the fact, that only whole words should be replaced. I updated my code and even generalized it for usage in a function.
def replace_whole(sentence, replace_token, replace_with, dont_replace):
rx = f"[\"\'\.,:; ]({replace_token})[\"\'\.,:; ]"
iter = re.finditer(rx, sentence)
out_sentence = ""
found = []
indices = []
for m in iter:
indices.append(m.start(0))
found.append(m.group())
context_size=len(dont_replace)
for i in range(len(indices)):
context = sentence[indices[i]-context_size:indices[i]+context_size]
if dont_replace in context:
continue
else:
# First replace the word only in the substring found
to_replace = found[i].replace(replace_token, replace_with)
# Then replace the word in the context found, so any special token like "" or . gets taken over and the context does not change
replace_val = context.replace(found[i], to_replace)
# finally replace the context found with the replacing context
out_sentence = sentence.replace(context, replace_val)
return out_sentence
Use regular expressions for finding all occurences and values of your string (as we need to check whether is a whole word or embedded in any kind of word), by using finditer(). You might need to adjust the rx to what your definition of "whole word" is. Then get the context around these values of the size of your no_replace rule. Then check, whether the context contains your no_replace string.
If not, you may replace it, by using replace() for the word only, then replace the occurence of the word in the context, then replace the context in the whole text. That way the replacing process is nearly unique and no weird behaviour should happen.
Using your examples, this leads to:
replace_whole(sen2, "sentence", "processed_sentence", "example sentence")
>>>"This is a second example sentence that shows how 'processed_sentence' in 'sentencepiece' should not be replaced."
and
replace_whole(sen1, "sentence", "processed_sentence", "example sentence")
>>>'This is an example sentence that needs to be processed into a new processed_sentence.'
After some research, this is what I believe to be the best and cleanest solution to my problem. The solution works by calling the match_fun whenever a match has been found, and the match_fun only performs the replacement, if and only if, there is no "no-replace-phrase" overlapping with the current match. Let me know if you need more clarification or if you believe something can be improved.
replace_dict = ... # The code below assumes you already have this
no_replace_dict = ...# The code below assumes you already have this
text = ... # The text on input.
def match_fun(match: re.Match):
str_match: str = match.group()
if str_match not in cls.no_replace_dict:
return cls.replace_dict[str_match]
for no_replace in cls.no_replace_dict[str_match]:
no_replace_matches_iter = re.finditer(r'\b' + no_replace + r'\b', text)
for no_replace_match in no_replace_matches_iter:
if no_replace_match.start() >= match.start() and no_replace_match.start() < match.end():
return str_match
if no_replace_match.end() > match.start() and no_replace_match.end() <= match.end():
return str_match
return cls.replace_dict[str_match]
for replace in cls.replace_dict:
pattern = re.compile(r'\b' + replace + r'\b')
text = pattern.sub(match_fun, text)
tokens = ['analytics', 'mining', 'quantities', ...]
for i in tokens:
stem = re.sub(r'(\w+)(tics$)', r'\1sis', i, flags=re.IGNORECASE)
In this example, I'm replacing 'analytics' with 'analysis' using the re.sub().
What I want to do is to do this replacement using multiple patterns, for example:
stem = re.sub(r'(\w+)(ing$)', r'\1e', i, flags=re.IGNORECASE)
So that 'mining' would be replaced by 'mine'. And so on.
I was thinking of using a dict with patterns and repls. I imagine the dict would look something like this:
rules = {
r'(\w+)(tics$)': r'\1sis',
r'(\w+)(ing$)': r'\1e',
...
}
Would the backreference even work in a dict? I also don't know how to implement a dict into re.sub. How should I proceed?
Edit for further clarification:
The whole tokens list has a lot of items and I want to do the replacement on words that match the pattern. For example there might be the word 'dining' further down in the tokens and I want to the the second rule to catch that and replace it with 'dine'.
Try using a custom function in re.sub
Ex:
import re
tokens = ['analytics', 'mining', 'quantities', 'analyticsddd']
replacement = {"tics": "sis", "ing": "e"}
ptrn = re.compile("(" + "|".join(replacement.keys()) + ")$")
for i in tokens:
print(ptrn.sub(lambda x: replacement.get(x.group(), x.group()), i))
Output:
analysis
mine
quantities
analyticsddd
This code is meant to read a text file and add every word to a dictionary where the key is the first letter and the values are all the words in the file that start with that letter. It kinda works but for
two problems I run into:
the dictionary keys contain apostrophes and periods (how to exclude?)
the values aren't sorted alphabetically and are all jumbled up. the code ends up outputting something like this:
' - {"don't", "i'm", "let's"}
. - {'below.', 'farm.', 'them.'}
a - {'take', 'masters', 'can', 'fallow'}
b - {'barnacle', 'labyrinth', 'pebble'}
...
...
y - {'they', 'very', 'yellow', 'pastry'}
when it should be more like:
a - {'ape', 'army','arrow', 'arson',}
b - {'bank', 'blast', 'blaze', 'breathe'}
etc
# make empty dictionary
dic = {}
# read file
infile = open('file.txt', "r")
# read first line
lines = infile.readline()
while lines != "":
# split the words up and remove "\n" from the end of the line
lines = lines.rstrip()
lines = lines.split()
for word in lines:
for char in word:
# add if not in dictionary
if char not in dic:
dic[char.lower()] = set([word.lower()])
# Else, add word to set
else:
dic[char.lower()].add(word.lower())
# Continue reading
lines = infile.readline()
# Close file
infile.close()
# Print
for letter in sorted(dic):
print(letter + " - " + str(dic[letter]))
I'm guessing I need to remove the punctuation and apostrophes from the whole file when I'm first iterating through it but before adding anything to the dictionary? Totally lost on getting the values in the right order though.
Use defaultdict(set) and dic[word[0]].add(word), after removing any starting punctuation. No need for the inner loop.
from collections import defaultdict
def process_file(fn):
my_dict = defaultdict(set)
for word in open(fn, 'r').read().split():
if word[0].isalpha():
my_dict[word[0].lower()].add(word)
return(my_dict)
word_dict = process_file('file.txt')
for letter in sorted(word_dict):
print(letter + " - " + ', '.join(sorted(word_dict[letter])))
You have a number of problems
splitting words on spaces AND punctuation
adding words to a set that could not exist at the time of the first addition
sorting the output
Here a short program that tries to solve the above issues
import re, string
# instead of using "text = open(filename).read()" we exploit a piece
# of text contained in one of the imported modules
text = re.__doc__
# 1. how to split at once the text contained in the file
#
# credit to https://stackoverflow.com/a/13184791/2749397
p_ws = string.punctuation + string.whitespace
words = re.split('|'.join(re.escape(c) for c in p_ws), text)
# 2. how to instantiate a set when we do the first addition to a key,
# that is, using the .setdefault method of every dictionary
d = {}
# Note: words regularized by lowercasing, we skip the empty tokens
for word in (w.lower() for w in words if w):
d.setdefault(word[0], set()).add(word)
# 3. how to print the sorted entries corresponding to each letter
for letter in sorted(d.keys()):
print(letter, *sorted(d[letter]))
My text contains numbers, so numbers are found in the output (see below) of the above program; if you don't want numbers filter them, if letter not in '0123456789': print(...).
And here it is the output...
0 0
1 1
8 8
9 9
a a above accessible after ailmsux all alphanumeric alphanumerics also an and any are as ascii at available
b b backslash be before beginning behaviour being below bit both but by bytes
c cache can case categories character characters clear comment comments compatibility compile complement complementing concatenate consist consume contain contents corresponding creates current
d d decimal default defined defines dependent digit digits doesn dotall
e each earlier either empty end equivalent error escape escapes except exception exports expression expressions
f f find findall finditer first fixed flag flags following for forbidden found from fullmatch functions
g greedy group grouping
i i id if ignore ignorecase ignored in including indicates insensitive inside into is it iterator
j just
l l last later length letters like lines list literal locale looking
m m made make many match matched matches matching means module more most multiline must
n n name named needn newline next nicer no non not null number
o object occurrences of on only operations optional or ordinary otherwise outside
p p parameters parentheses pattern patterns perform perl plus possible preceded preceding presence previous processed provides purge
r r range rather re regular repetitions resulting retrieved return
s s same search second see sequence sequences set signals similar simplest simply so some special specified split start string strings sub subn substitute substitutions substring support supports
t t takes text than that the themselves then they this those three to
u u underscore unicode us
v v verbose version versions
w w well which whitespace whole will with without word
x x
y yes yielding you
z z z0 za
Without comments and a little obfuscation it's just 3 lines of code...
import re, string
text = re.__doc__
p_ws = string.punctuation + string.whitespace
words = re.split('|'.join(re.escape(c) for c in p_ws), text)
d, add2d = {}, lambda w: d.setdefault(w[0],set()).add(w) #1
for word in (w.lower() for w in words if w): add2d(word) #2
for abc in sorted(d.keys()): print(abc, *sorted(d[abc])) #3