How can I use the maketrans() and tranlate() method on a string to get the translated string in Python3?
eg.
In variable
s='The quick brown fox'
If I have to replace 'The' with '123' in variable 's' how can I do it with these two methods in Python3?
output: '123 quick brown fox'
str.translate is not the right function to use for this particular use case. What it does is, it replaces individual characters by another character. It does not work with character groups. For example:
>>> string = 'The quick brown fox and a hound'
>>> tab = str.maketrans('The', '123')
>>> string.translate(tab)
'123 quick brown fox and a 2ound'
In addition to translating 'The', it also translated the 'h' in 'hound'.
For your particular use case, str.replace would be a good alternative:
>>> string.replace('The', '123')
'123 quick brown fox and a hound'
Related
Given a phrase in a given line, I need to be able to match that phrase even if the words have a different number of spaces in the line.
Thus, if the phrase is "the quick brown fox" and the line is "the quick brown fox jumped over the lazy dog", the instance of "the quick brown fox" should still be matched.
The method I already tried was to replace all instances of whitespace in the line with a regex pattern for whitespace, but this doesn't always work if the line contains characters that aren't treated as literal by regex.
This should work:
import re
pattern = r'the\s+quick\s+brown\s+fox'
text = 'the quick brown fox jumped over the lazy dog'
match = re.match(pattern, text)
print(match.group(0))
The output is:
the quick brown fox
You can use this regex. Check here
(the\s+quick\s+brown\s+fox)
You can split the given string by white spaces and join them back by a white space, so that you can then compare it to the phrase you're looking for:
s = "the quick brown fox"
' '.join(s.split()) == "the quick brown fox" # returns True
for the general case:
replace each sequence of space characters in only one space character.
check if the given sentence is sub string of the line after the replacement
import re
pattern = "your pattern"
for line in lines:
line_without_spaces= re.sub(r'\s+', ' ', line)
# will replace multiple spaces with one space
return pattern in line_without_spaces
As your later clarified, you needed to match any line and series of words. To achieve this I added some more examples to clarify what the both proposed similar regexes do:
text = """the quick brown fox
another line with single and multiple spaces
some other instance with six words"""
Matching whole lines
The first one matches the whole line, iterating over the single lines
pattern1 = re.compile(r'((?:\w+)(?:\s+|$))+')
for i, line in enumerate(text.split('\n')):
match = re.match(pattern1, line)
print(i, match.group(0))
Its output is:
0 the quick brown fox
1 another line with single and multiple spaces
2 some other instance with six words
Matching whole lines
The second one matches single words and iterates of them one-by-one while iterating over the single lines:
pattern2 = re.compile(r'(\w+)(?:\s+|$)')
for i, line in enumerate(text.split('\n')):
for m in re.finditer(pattern2, line):
print(m.group(1))
print()
Its output is:
the
quick
brown
fox
another
line
with
single
and
multiple
spaces
some
other
instance
with
six
words
I am trying to search a string, which I know is always a sentence, to find the three words that come before and three words that come after a comma. Is regex the right way to do this AND how do you account for the fact that sometimes you will be at the beginning and end of a sentence and there will not be 3 words?
Thanks for the help, trying to learn regex.
Hmm, that one's a bit long, but I guess it works:
>>> import re
>>> string = "The brown fox jumped over the red barn, and found the chickens."
>>> res = re.findall(r'(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?[^a-z]*,\s*(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?[^a-z]*(\b[a-z]+\b)?', string, re.IGNORECASE)
>>> res
[('the', 'red', 'barn', 'and', 'found', 'the')]
This will ignore numbers as well, for strings such as:
string = "The brown fox jumped over the red barn, and found 10 chickens."
To give:
[('the', 'red', 'barn', 'and', 'found', 'chickens')]
For things like:
string = "The brown fox jumped over the red barn, and fled."
It gives:
[('the', 'red', 'barn', 'and', 'fled')]
And same for words before the comma.
\b refers to a word boundary and will match only at the end of a word (letter, or number).
[a-z]+ refers to a character class, namely all the letters from a to z. The + at the end indicates that this character class is repeated more than once, thus fulfilling the match of a full word.
(\b[a-z]+\b) is a capture group (notice the brackets) and will be stored in result. Adding a question mark at the end will indicate a possible occurrence (i.e. will match if it exists, and it won't match if it doesn't exist, thus how you can get results if there are less than 3 words before the comma).
[^a-z]* is a negated class, notice the caret just after the opening square bracket. It will match any character, not being letters a through z. The asterisk * indicates an occurrence of 0 or more times.
, is a literal comma.
\s is a space, tab, newline character. The asterisk after it still means an occurrence of 0 or more times.
re.IGNORECASE, as it suggests, will make the match case-insensitive.
for your example,
sen= "The brown fox jumped over the red barn,and found the chickens"
result_left=sen.split(',')[0].split()[-3:]
#result_left ['the', 'red', 'barn']
#for the right words
result_right=sen.split(',')[1].split()[:3]
I have a string in Python, say The quick #red fox jumps over the #lame brown dog.
I'm trying to replace each of the words that begin with # with the output of a function that takes the word as an argument.
def my_replace(match):
return match + str(match.index('e'))
#Psuedo-code
string = "The quick #red fox jumps over the #lame brown dog."
string.replace('#%match', my_replace(match))
# Result
"The quick #red2 fox jumps over the #lame4 brown dog."
Is there a clever way to do this?
You can pass a function to re.sub. The function will receive a match object as the argument, use .group() to extract the match as a string.
>>> def my_replace(match):
... match = match.group()
... return match + str(match.index('e'))
...
>>> string = "The quick #red fox jumps over the #lame brown dog."
>>> re.sub(r'#\w+', my_replace, string)
'The quick #red2 fox jumps over the #lame4 brown dog.'
I wasn't aware you could pass a function to a re.sub() either. Riffing on #Janne Karila's answer to solve a problem I had, the approach works for multiple capture groups, too.
import re
def my_replace(match):
match1 = match.group(1)
match2 = match.group(2)
match2 = match2.replace('#', '')
return u"{0:0.{1}f}".format(float(match1), int(match2))
string = 'The first number is 14.2#1, and the second number is 50.6#4.'
result = re.sub(r'([0-9]+.[0-9]+)(#[0-9]+)', my_replace, string)
print(result)
Output:
The first number is 14.2, and the second number is 50.6000.
This simple example requires all capture groups be present (no optional groups).
Try:
import re
match = re.compile(r"#\w+")
items = re.findall(match, string)
for item in items:
string = string.replace(item, my_replace(item)
This will allow you to replace anything that starts with # with whatever the output of your function is.
I wasn't very clear if you need help with the function as well. Let me know if that's the case
A short one with regex and reduce:
>>> import re
>>> pat = r'#\w+'
>>> reduce(lambda s, m: s.replace(m, m + str(m.index('e'))), re.findall(pat, string), string)
'The quick #red2 fox jumps over the #lame4 brown dog.'
I'm using pyparsing to parse documents containing text in which the line ends vary in location. I need to write a parser expression that matches the text regardless of line break location. The following does NOT work:
from __future__ import print_function
from pyparsing import *
string_1 = """The quick brown
fox jumps over the lazy dog.
"""
string_2 = """The quick brown fox jumps
over the lazy dog.
"""
my_expr = Literal(string_1)
print(my_expr.searchString(string_1)
print(my_expr.searchString(string_2)
This results in the following being displayed on the console:
[['The quick brown \nfox jumps over the lazy dog.\n']]
[]
Since line breaks are included in ParserElement.DEFAULT_WHITE_CHARS, I don't understand why both strings do not match my expression. How do I create a parser element which DOES match text regardless of where the line breaks occur?
Your question is a good example of why I discourage people from defining literals with embedded whitespace, because this defeats pyparsing's built-in whitespace skipping. Pyparsing skips over whitespace between expressions. In your case, you are specifying only a single expression, a Literal comprising an entire string of words, including whitespace between them.
You can get whitespace skipped by breaking your string up into separate Literals (adding a string to a pyparsing expression automatically constructs a Literal from that string):
from pyparsing import *
my_expr = Literal("The") + "quick" + "brown" + "fox" + "jumps" + "over" + "the" + "lazy" + "dog"
string_1 = """The quick brown
fox jumps over the lazy dog.
"""
string_2 = """The quick brown fox jumps
over the lazy dog.
"""
for test in (string_1, string_2):
print '-'*40
print test
print my_expr.parseString(test)
print
If you don't like typing all those separate quoted strings, you can have Python split the string up for you, map them to Literals, and feed the whole list to make up a pyparsing And:
my_expr = And(map(Literal, "The quick brown fox jumps over the lazy dog".split()))
If you want to preserve the original whitespace, wrap your expression in originalTextFor:
my_expr = originalTextFor(my_expr)
This may be a silly question but...
Say you have a sentence like:
The quick brown fox
Or you might get a sentence like:
The quick brown fox jumped over the lazy dog
The simple regexp (\w*) finds the first word "The" and puts it in a group.
For the first sentence, you could write (\w*)\s*(\w*)\s*(\w*)\s*(\w*)\s* to put each word in its own group, but that assumes you know the number of words in the sentence.
Is it possible to write a regular expression that puts each word in any arbitrary sentence into its own group? It would be nice if you could do something like (?:(\w*)\s*)* to have it group each instance of (\w*), but that doesn't work.
I am doing this in Python, and my use case is obviously a little more complex than "The quick brown fox", so it would be nifty if Regex could do this in one line, but if that's not possible then I assume the next best solution is to loop over all the matches using re.findall() or something similar.
Thanks for any insight you may have.
Edit: For completeness's sake here's my actual use case and how I solved it using your help. Thanks again.
>>> s = '1 0 5 test1 5 test2 5 test3 5 test4 5 test5'
>>> s = re.match(r'^\d+\s\d+\s?(.*)', s).group(1)
>>> print s
5 test1 5 test2 5 test3 5 test4 5 test5
>>> list = re.findall(r'\d+\s(\w+)', s)
>>> print list
['test1', 'test2', 'test3', 'test4', 'test5']
You can also use the function findall in the module re
import re
>>> re.findall("\w+", "The quick brown fox")
['The', 'quick', 'brown', 'fox']
I don't believe that it is possible. Regexes pair the captures with the parentheses in the given regular expression... if you only listed one group, like '((\w+)\s+){0,99}', then it would just repeatedly capture to the same first and second group... not create new groups for each match found.
You could use split, but that only splits on one character value, not a class of characters like whitespace.
Instead, you can use re.split, which can split on a regular expression, and give it '\s' to match any whitespace. You probably want it to match '\s+' to gather the whitespace greedily.
>>> import re
>>> help(re.split)
Help on function split in module re:
split(pattern, string, maxsplit=0)
Split the source string by the occurrences of the pattern,
returning a list containing the resulting substrings.
>>> re.split('\s+', 'The quick brown\t fox')
['The', 'quick', 'brown', 'fox']
>>>
Why use a regex when string.split does the same thing?
>>> "The quick brown fox".split()
['The', 'quick', 'brown', 'fox']
Regular expressions can't group into unknown number of groups. But there is hope in your case. Look into the 'split' method, it should help in your case.